Data Transfer – FASRC DOCS https://docs.rc.fas.harvard.edu Mon, 12 May 2025 17:58:49 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png Data Transfer – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 Tips for using tar to archive data https://docs.rc.fas.harvard.edu/kb/tips-for-tar-archiving/ Thu, 17 Apr 2025 17:53:30 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28615 This document assumes you are creating a tar archive(s) of a directory and its contents. If your data is not contained in the same directory, the following will not work for you as written.

Example use cases:

  • You are tar’ing up data to move to tape
  • You are creating an archive for sharing or record-keeping purposes
  • You are tar’ing up data to move to some other storage location or to transfer elsewhere

The gist of this article is to help you think about and plan the creation of a file list, a checksum file, one or many tar files, and to capture any additional metadata as needed (such as file ACLs aka FACLs).

In the examples below we will use an example path /n/mypath/scans, you should replace this with your path.

Our initial example will then concentrate on an example sub-directory in /n/mypath/scans called myscans.
We recommend naming the resulting list, checksum, facl, and tar files so that their origin or purpose is obvious.
As such, our examples have names like: n-mypath-scans-myscans-041525.

An Important Note About Long-Term File Integrity

Keep your archive reasonably sized, both for long-term file integrity and with the intended destination in mind. If you need to create one very large tar file for, say, transferring to a colleague, that may make sense as the data will remain intact in its original location. But for archival purposes, very large files increase the potential for data loss from file corruption. A corrupted tar file’s contents may not be recoverable. So limit your footprint so that, should such a worst case happen, you do not lose all your data, only a portion. While this is not a common issue, it should be taken into account.

For example: Let’s say you have one terabyte (1TB) of data and you want to transfer it to tape. While a tape cartridge may hold up to 20TiB (roughly 0.9 TB) of data, we would caution against making a 1TB archive. In the unlikely event of file corruption, you could lose the entire archive. Instead, you should break the task up into smaller parts of, say, 50GB or 100GB or 2ooGB.  If you can do so based on directories and sub-directories, even better. So it helps to arrange and plan your archiving ahead of time.

Creating a File List

Get a file list from the directory you intend to create an archive of. You can use this to find files later and observe the original directory structure.

cd /n/mypath/scans
find myscans/ -type f > n-mypath-scans-myscans-041525.txt

Repeat as necessary.

Create a Checksum File

find myscans/ -type f -print0|xargs --null -P $(nproc) shasum|sort > n-mypath-scans-myscans-041525.shasum

This will create a SHA-1 checksum of myscans and its contents.

You can also view the checksum value live on the command line, which can be useful to make sure nothing has changed since you ran the initial checksum:
cd /n/mypath/scans
find myscans/ -type f -print0|xargs --null -P shasum|sort

NOTE: If the data in the directory is modified after you’ve run the checksum and before you’ve tar’d it, then the checksum will no longer match later when un-tar and compare later. If you need to tar an active filesystem, then checksum’ing will not be useful to you.

Create a FACLs File (if applicable)

If your filesystem has special ACLs applied and you would want to reapply them to this data if it is restored to the same location later, you should capture the ACLs/FACLs to a file. If you’re unsure, it won’t hurt to just do this regardless.

getfacl -R myscans/ > n-mylab-scans-myscans=041525.facl

If you later restore the data to its original location you should be able to put the same ACLs back in place using:

setfacl --restore=n-mylab-scans-myscans=041525.facl mylab/

Create Your tar File(s)

Where you intend to initially store your tar files is up to you. If your lab space has room that’s fine, but perhaps consider using netscratch if you have a lot of data or limited lab space. You can then move the files as needed.

Using the same model of directories as above, you can create your tar files like so:

cd  /n/mypath/scans

tar -cf /path-to-store-your-tars/n-mylab-scans-myscans-041525.tar myscans

For example: tar -cf /n/netscratch/jharvard/n-mylab-scans-myscans-041525.tar myscans

Caveats and Recommendations

Bear in mind you will get different checksum results depending on the path used. This is why we recommend you cd to the directory above (in our example /n/mypath/scans) the directory you are about to tar and then use the relative path (in our example myscans).

For instance if I do the full path:

find /home/mmcfee/myscans/ -type f -print0|xargs --null -o shasum|sort|shasum
> 1519655cae31924d16e251f6040537d7e30d9a66
versus
find myscans/ -type f -print0|xargs --null -o shasum|sort|shasum
> 35dbb789dc1c5f820f2ead0fbcd0989501db0692 –

As such, we recommend always doing these the exact same way and ideally just the directory and then sub-directory in question.
If you cannot do this, checksum’ing may not work for you.

Storing Your Files and tar Archives

NOTE: may be obvious but worth mentioning, you cannot store the checksum file in the tar unless you plan on removing it after un-tar’ing and before re-running the checksum because it will affect the checksum.

If this method works for you and you have some or all of the companion files for your tar file, we recommend

A) storing those files together alongside the tar file
-or-
B) storing the companion files in a known, single location in your lab space.

Option B is better for multiple tapes (or just peace of mind) so that if you want to find which tar file has the files you need, you can look at your file list versus having to pull multiple tapes back.

To restore your data, you do the original process in reverse. In this example let’s say I’m putting it back in /n/mypath/scans and the tar file is in my lab’s netscratch.

  • cd /n/mypath/scans
  • tar -xf /n/netscratch/my_lab/n-mylab-scans-myscans-041525.tar myscans
  • checksum myscans once tar finished:
    find myscans/ -type f -print0|xargs --null -P $(nproc) shasum|sort
  • compare that checksum with the one in the original checksum file (n-mylab-scans-myscans-041525.shasum)

A Recommendation Regarding Tape

If you are storing these tar archives on tape and need to use multiple tapes, we also recommend keeping a local file which records where each tar file was stored.

That way, if you need mylab-scans-myscans-041525.tar you can look at this record and see that it was put on tape #2. This will make retrievable simpler and avoid wasting time pulling back both tapes.

 

]]>
28615
Globus: Transfer Data to Tape https://docs.rc.fas.harvard.edu/kb/tape-globus-access/ Wed, 16 Mar 2022 19:40:12 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=24793 Globus is one of the access mechanisms to transfer data to and from the NESE Tape System.  Please review Storage Service Center and Cold Storage (tape) docs as this has references for tools like Coldfront and allocations.

See the Globus Overview documentation to learn FASRC-specific details that are needed to use Globus.

Globus login

See a video and step-by-step instructions in the Globus login documentation.

Tape Storage via Globus

Permission to access your lab’s tape collection

Access to Tape via Globus, by default, is only given to the PI and General Manager (Project Membership and Roles). The PI and General Manager can also add other lab members to access tape, but the entire group does NOT have access by default to tape storage.

If you follow the instructions below and do not see the tape Collection, that means that the PI or General Manager has not shared the Collection with you. Ask them to add you so you can access it.

If you are the PI and don’t see a tape collection, go to the Storage Service Center and request allocation using Coldfront. To update the existing allocation, go to Coldfront and update the allocation.

Access your lab’s tape collection

After you’ve logged in to Globus

  1. On the left-hand menu, go to the “File Manager”
  2. Click on the “Collections” bar and type your “Name Research Lab Tape HU FASRC”, where Name is your lab’s name (e.g., jharvard_lab will be Jharvard Research Lab Tape HU FASRC).
  3. Check the box “Search All Collections” below the “Collections” bar
    In the screenshot below is an example when searching “Research Lab Tape HU FASRC” (without Name).
  4. Click on your Research Lab collection

Share Lab Tape Collection

PI’s and General Managers can share the collection with Lab members.

  1. Go to the Collection > Shared with you
  2. Click on the lab tape collection.
    You can also search for your lab collection in the search area on the top menu.
  3. Click on the Permissions tab.
  4. On the top right corner, there is an “Add Permissions” button to add users. Search for a user there.
    1. To add a user, search for their name in the Username or Email field. If you are not able to search for that user, there are a few possibilities:
      1. You have the wrong Globus identity. Ask them what their Globus identity is.
      2. They don’t have a Globus account yet. Please send them the Globus File Transfer page and ask them to log in to Globus.
      3. If you need further assistance, feel free to write to rchelp@rc.fas.harvard.edu or join our office hours.
    2. Give read and write access. If it’s for a collaborator and you want to share only part of the data, please follow How To Share Data Using Globus

Transferring Data from FASRC storage to Tape

Note 1: The tape system is designed to place large amounts of data onto tape cartridges for cold storage documentation

  1. This is designed to be a slow I/O system, data is placed on tapes for long-term storage. Retrieving data from tape by the user should only be at low volume scales.
  2. Bulk retrieval of data from tape needs to be handled by the FASRC/NESE team. A retrieval request must be made well in advance of data needs.

Note 2: Carefully read the Tips for using tar to archive data documentation

File Manager in Globus is to transfer data to and from Tape and other FASRC storage. We have Globus File Transfer doc to provide you with more details about Globus and its features. Here, we have a simple example to transfer data from Tier 0 storage in Holyoke to Tape.

In the Globus File Manager, there is a Panels button on the top right, and I have selected the split panel as below. We have Harvard FAS RC Holyoke collection on the left collection, where we have most of the Holyoke storage mounted (like holylfsxx, holystore01, etc,) and on right, I have the Tape storage we have for testing.

To move data to tape, you should make sure the data is either compressed or files are greater than 100MB before transferring any data. You can review the files after login to the cluster and creating compressed files for folders with small files. You can use zip, tar, gzip, or your favorite tools for compression or review this. Also, make sure you name the files appropriately so you can find them from the Globus interface. In the example below, we have created a folder with the month and year for these files. You can use the project name, user, etc to make the discovery easier.  Also see Tips for using tar to archive data.
You will need to review the Globus menu as shown below to see what all you can do with the share you have.
After you select the data you want to transfer from storage to tape, the start button on the top will start the transfer. To transfer the data from the tape, the start button on the right menu can be used. Data transfer from tape can take time as we only have data for the last few hours on buffer storage and the remaining data is stored on tape as it may take some time to move the data back to buffer storage so it can be transferred to the requested storage/collection.

Please follow NESE documentation and Globus documentation, or write to rchelp@rc.fas.harvard.edu or join our office hours for further assistance.

]]>
24793
Data Transfers with rclone https://docs.rc.fas.harvard.edu/kb/rclone/ Thu, 24 Oct 2019 15:19:58 +0000 https://www.rc.fas.harvard.edu/?page_id=22027  

Introduction

rclone is a convenient and performant command-line tool for transferring files and synchronizing directories directly between FAS RC file systems and Google Drive (or other supported cloud storage systems). If you are eligible, and don’t already have a Harvard Google Workspace account, see the Getting started with Google Drive page. If you require help or support for your Harvard Google Workspace account or for Google Drive itself, please contact HUIT (ithelp@harvard.edu).

Configuring rclone

rclone must be configured before first use. Each cloud service has a specific configuration. Visit rclone documentation to find the specific cloud service that you need, click on its specific “config”, and follow the rclone config steps.

Google Shared Drives

To configure access to a Google shared drive, visit rclone google drive configuration. During the configuration, there is an option to select “Configure this as a Shared Drive (Team Drive)?”

Using rclone

rclone supports many subcommands (see the complete list of rclone subcommands). A few commonly-used subcommands (assuming a Google Drive configured as gdrive):

Listing / moving / deleting objects
rclone command analogous Unix command
rclone lsf gdrive:fasrc/subfolder ls fasrc/subdir
rclone lsf –format stp –separator ‘ ‘ gdrive:fasrc/subfolder ls -l fasrc/subdir
rclone mkdir gdrive:fasrc/subfolder mkdir fasrc/subdir
rclone move gdrive:fasrc/subfolder1/file1 gdrive:fasrc/subfolder2/ mv fasrc/subdir/file1 fasrc/subdir
rclone rmdir gdrive:fasrc/subfolder rmdir fasrc/subdir
rclone delete gdrive:fasrc/file rm fasrc/file
rclone purge gdrive:fasrc/subfolder rm -r fasrc/subdir

 

Transferring data

Small data transfers may be done on FAS RC cluster login nodes, while large data transfers should be done within an interactive job so that data transfer is done from a compute node; e.g.:

salloc -p test --mem 1G -t 6:00

Operands with the gdrive: prefix (assuming a Google Drive has been configured as gdrive) access Google Drive storage, while operands without gdrive: refer to a path on the FAS RC file system.

rclone copy gdrive:sourcepath destpath
rclone copy sourcepath gdrive:destpath

If sourcepath is a file, copy it to destpath.
If sourcepath is a directory/folder, recursively copy its contents to destpath. Contents of destpath that are not in sourcepath will be retained.

rclone sync –progress gdrive:sourcefolder destdir
rclone sync –progress sourcedir gdrive:destfolder

Replace contents of destdir/destfolder with the contents of sourcedir/sourcefolder (deleting any files not in the source).

 

Mounting Google Drive on a FAS RC compute node

Alternatively, rclone mount can make a Google Drive (subfolder) available on a FAS RC compute node as a regular file system (e.g., supporting common commands; such as cp, mv, and ls; that are used to interact with a POSIX file system), with limitations.

The directory on the FAS RC node at which the Google Drive will be made available as a file system (i.e., the mountpoint) must be on a node-local file system (such as /scratch) to avoid permissions issues when unmounting the file system. In particular, the mountpoint must not be within a file system in the /n/ directory, as these are all remote / network file systems.
The following example illustrates demonstrates this capability:

$ rclone lsf gdrive:fasrc/
cactus:2019.03.01--py27hdbcaa40_1.sif
ifxpong:1.4.7-ood.sif
jbrowse:1.16.5_2019-06-14.sif
subfolder/
$ mkdir /scratch/$USER
$ mkdir -m 700 /scratch/$USER/gdrive
$ rclone mount --daemon gdrive:fasrc /scratch/$USER/gdrive
$ ls -l /scratch/$USER/gdrive/
total 543900
-rw-r--r-- 1 fasrcuser fasrcgroup 495247360 May  1 16:27 cactus:2019.03.01--py27hdbcaa40_1.sif
-rw-r--r-- 1 fasrcuser fasrcgroup 50700288 Aug 22 16:05 ifxpong:1.4.7-ood.sif
-rw-r--r-- 1 fasrcuser fasrcgroup 11005952 Jun 14 15:16 jbrowse:1.16.5_2019-06-14.sif
drwxr-xr-x 1 fasrcuser fasrcgroup 0 Oct 24 10:21 subfolder
cactus_2019.09.03-623cfc5.sif  JBrowse-on-Cluster.tar.gz  MAKER-cluster-guide-for-review.tar.gz
$ fusermount -uz /scratch/$USER/gdrive/
[1]+  Done                    rclone mount gdrive:fasrc /scratch/$USER/gdrive

Comments:

  • The mountpoint (/scratch/$USER/gdrive) is created with appropriate permissions (via mkdir -m 700) to ensure only the owner has access.
  • The rclone mount command is executed in the background (daemon mode).
  • fusermount -uz explicitly unmounts the Google Drive (causing the rclone mount process to terminate).
    • This performs a “lazy unmount”, which requests that the OS perform the unmount when there are no processes whose current working directory is within the directory tree rooted at the mountpoint. To guard against accidentally leaving the directory mounted if a job or interactive session is prematurely terminated, the fusermount -uz command can be immediately issued after setting the working directory of the shell process that issues the rclone mount command can to the gdrive mountpoint; e.g.:
      rclone mount --daemon gdrive:fasrc /scratch/$USER/gdrive
      cd /scratch/$USER/gdrive && fusermount -uz .

      Then /scratch/$USER/gdrive will be automatically unmounted when the shell’s process has terminated or its working directory changed to a directory outside of /scratch/$USER/gdrive:

      cd ..
      [1]+ Done rclone mount gdrive:fasrc /scratch/$USER/gdrive
      

Limitations

At most 2 file transfers to Google Drive can be initiated per per second. Consider bundling many small files into a .zip or .tar(.gz) file.
Other Google drive limitations are listed in the rclone Google Drive documentation.

]]>
22027
Transferring Data on the Cluster https://docs.rc.fas.harvard.edu/kb/transferring-data-on-the-cluster/ Tue, 29 Jan 2019 10:20:55 +0000 https://www.rc.fas.harvard.edu/?page_id=19654

Watch Data movement on FAS Storage video

See also our Introduction to FASRC Cluster Storage video

There are several ways to move data around the cluster. The first consideration before deciding on what technique to use is what filesystems you are moving data from and how they are connected to the cluster. By and large for most filesystems, especially those connected to the cluster via Infiniband, using the compute nodes themselves to move data around is your best bet. Thus before doing any data transfers one should either start up an interactive session on the cluster or put together a batch script that contains the commands you want to use to move the data. The advantage of the batch script is that it allows you to fire off the move and not have to babysit the session being open. Plus you can also do multiple transfers at once, leveraging the power of the cluster. That said be sure the filesystems you are transferring from and to can handle the parallel traffic. In general Lustre filesystems can handle many parallel requests while NFS cannot.

For actually moving the data the following commands, in order of complexity, can be used:

  • cp/mv
  • rsync
  • fpsync

With rsync being the generally most useful of the commands.

cp/mv

Both cp and mv are standard Unix commands that will copy or move the data to a new location. They are easy and relatively straightforward to use. cp will make a second copy of the data, adding -R as an option will copy a folder recursively. On the other hand mv will move that data to a new location leaving only one copy of the data at the new location. mv also is the preferred tool for renaming files and folders as well as moving internally to a filesystem, as all it does is change the pointer name to the data. The downside of cp and mv is that neither gives any indication of how well it is performing, and neither can pick up from an incomplete transfer. Thus for bulk transfers cp and mv should be avoided. An example of cp and mv are below:

cp file.txt /n/netscratch/lab/.
cp -R folder /n/netscratch/lab/.
mv file.txt /n/netscratch/lab/.
mv folder /n/netscratch/lab/.

rsync

For the vast majority of transfers rsync will get the job done. We have a lengthy page on rsync here. In summary though rsync can allow you to copy entire directories as well as pick up from where you left off in the transfer if the transfer fails for some reason. In addition rsync is very handy for matching the contents of two directories. The most common rsync command for data transfer is as follows:

rsync -avx --progress folder/ /n/netscratch/lab/folder/

This will ensure that the folder is mirrored exactly over to the other filesystem. It will also make sure that the copy will not traverse symlinks to other filesystems that you do not wish to copy. Be aware though that rsync will match the time stamps between the copies, thus the transfer will look old to the scratch cleaner if you are copying to our scratch filesystems. To have rsync use the timestamp of the date you actually did the transfer add the --no-times option.

fpsync

Note: The following assume you are running fpsync from a job or interactive session with more than one core. You cannot utilize fpsync on the login nodes.

rsync is really great for single stream moves, especially when you have large files. However for very large directories, or for many files, one needs to take rsync to the next level. This is what fpsync does. fpsync is essentially a parallel rsync. It generates a list of files to transfer and then spawns a rsync to do the transfer. You can set the total number of rsyncs which helps to parallelize your transfer. fpsync needs to be used with care though as it can overwhelm nonparallel filesystems like NFS. However for transferring in between Lustre filesystems fpsync can move data very quickly. In general the fpsync command will be:

fpsync -n NUMRSYNC -o "RSYNC OPTIONS" -O "FPSYNC OPTIONS" /n/lablfs/folder/ /n/netscratch/lab/folder/

In most situations your fpsync line will look like:

fpsync -n NUMRSYNC -o "-ax" -O "-b" /n/lablfs/folder/ /n/netscratch/lab/folder/

Note that the fpsync logs are found in /tmp on the host you are doing the transfer on, so its harder to get an idea as to how far along fpsync is. As a general rule it is best not to set NUMRSYNC higher than the number of cores on a host. If you submit this via a job you should also wrap fpsync in srun to get the full usage, like so:

srun -c $SLURM_CPUS_PER_TASK fpsync -n $SLURM_CPUS_PER_TASK -o "-ax" -O "-b" /n/lablfs/folder/ /n/netscratch/lab/folder/

Where the number of CPUS you request for Slurm is the number of parallel rsyncs you want to run.

The /tmp path can be changed if needed using the -t option.

WARNING: DO NOT USE –delete as an option for fpsync

]]>
19654
Globus File Transfer https://docs.rc.fas.harvard.edu/kb/globus-file-transfer/ Tue, 25 Oct 2016 17:02:02 +0000 https://rc.fas.harvard.edu/?p=15555 Overview

Globus is a service enabling file sharing with external collaborators without requiring them to have accounts on FAS Research Computing systems. A collaborator has to use their Globus account login and their Globus shared collection, while a FAS Research Computing user has to follow the steps described in this document to gain access to the Globus service.

Globus is a third-party service that FASRC and other universities use as a nexus to share and transfer data from/to their filesystems. It is not run by FASRC.

For more information about Globus, watch What is Globus? from the Globus Team.

Collections: Harvard FAS RC Holyoke or Boston

FAS RC has four collections (or endpoints) based on data centers and data security level

  1. “Harvard FAS RC Holyoke”: for lab shares located in the Holyoke data center
  2. “Harvard FAS RC Holyoke Secure”: for lab shares with data security level 3 located in the Holyoke data center
  3. Harvard FAS RC Boston“: for lab shares located in the Boston data center
  4. Harvard FAS RC Boston Secure“: for lab shares with data security level 3 located in the Boston data center

Available storage

Almost all FASRC storage options are available through Globus, with the caveat of “Available folders” below. Data security level 3 storage shares are not automatically added to Globus.

Home directories are not and will not be added to Globus under any circumstances. Sharing home directories creates a security vulnerability, as authentication keys/passwords may be shared inadvertently. Therefore, home directories cannot be shared via Globus.

Available folders

Globus can only see the folders (or directories) Lab and Users/$USER — as of 2025, the Users folder has been deprecated.  When using Globus, any folder must be inside Lab or Users/$USER. Anything outside Lab and Users/$USER, such as Everyone and Transfer, is not available through Globus.

If you find that your lab share is not seen by Globus or these sub directories don’t exist in your lab’s share, in the Holyoke or Boston collections, please contact us.

Important notes

  • You can share many things, but be careful, don’t share more than what you should.
  • Symlinks do not work in Globus
  • To share files in various directories, you should create a parent directory and copy the various directories to the new parent directory.
  • Read Tips for using tar to archive data documentation before tarring your data
  • Globus will use your FASRC account’s permissions the same way you would if you were accessing your lab storage from a node on the cluster. This dictates what Globus can see and not see when inside the Globus File manager. Therefore:
    • You need read access to the file and directory to transfer files out
    • You need write access to the directory to copy files in
  • See Globus limits

Globus login

For how to log in to Globus, watch the video or go through the steps below.

  1. Go to Globus.
  2. In the top right corner, click on “Login”.
  3. In the organizational login page, select “Harvard University.”
  4. Log in with HarvardKey. After logging in, you will land on Globus File Manager page.
    If you are unable to complete this step, please contact HUIT to ensure your HarvardKey is current/enabled.
  5. In the Globus file manager page:
    1. Collection (or endpoint): FAS RC has four collections based on data centers
      1. “Harvard FAS RC Holyoke”: for lab shares located in the Holyoke data center
      2. “Harvard FAS RC Holyoke Secure”: for lab shares with data security level 3 located in the Holyoke data center
      3. Harvard FAS RC Boston” for lab shares located in the Boston data center
      4. “Harvard FAS RC Holyoke Secure”: for lab shares with data security level 3 located in the Boston data center
    2. Type one of these names based on the storage location.
    3. Click the Collection, and you will be asked to authenticate your FASRC account.
    4. Click Continue.
    5. Log in with FASRC username and Token (FASRC two-factor authentication with 6 digits)
      FASRC login with two text boxes. The first text box requires the FASRC username. The second text box requires the FASRC two-factor authentication. There is also a Log In button to click after the required information has been entered.
    6. In the “Path” bar, type the storage location that you would like to access it.

If you have difficulty connecting at any point in the process, connect to the RC VPN and try again, as certain steps require connectivity to our internal networks.

Transferring data with Globus

Step 0: Prerequisite knowledge

If your data contains directories with hundreds or thousands of files, you will need to tar those directories up into subset files. (Type man tar at the command line to view the manual page for tar.)

Too many files in a single directory, while generally never a good idea, will cause Globus to go into a ‘endpoint is too busy’ state and your job will timeout, restart, timeout, etc.

We recommend that your tar files range in size from 1-100 GiB. There are several reasons this size range is ideal:
  1. The file will transfer more quickly, especially if the transfer is interrupted
  2. The file will be smaller if it needs to be retrieved from Tape

We highly recommend carefully reading the Tips for using tar to archive data documentation before tarring your data.

Example 1: Transfer to/from a collection shared by a collaborator

In this example, we show how to transfer files to/from your folder and an external collaborator’s Globus shared collection

  1. You should have received an email or link from your collaborator
  2. Open the link in a web browser (e.g., Chrome, Firefox)
  3. Select the files to transfer
  4. Specify the transfer settings
  5. Initiate the transfer by clicking on the large arrow icon

Example 2: Share a subfolder that you own on FASRC with a collaborator

The How To Share Data Using Globus tutorial shows, step-by-step, how to share out a subfolder by first creating a Guest Collection and authorizing a collaborator’s Globus account to access your shared Guest Collection. Below is a summary of the steps. For more details, we recommend going by the step-by-step instructions in How To Share Data Using Globus.

Note that this works regardless of the collaborator’s affiliation with FASRC. An external collaborator without an FASRC account and with a Globus account can transfer to/from your shared Guest Collection.

  1. Get your collaborator’s Globus account
  2. Go to Globus File Manager
  3. Select the files to share
  4. Click on the “Share” icon (or right-click and select “Share”)
  5. Click on “Add Guest Collection”
  6. Fill in the information for the new Guest Collection
  7. Click on “Create Collection”
  8. In the “Permissions” tab, click on “Add Permissions — Share With”
  9. Share with the collaborator account
  10. The collaborator will receive a notification email
  11. Collaborator will then be able to initiate file transfers to/from your shared collection

Example 3: Transfer to/from a laptop or desktop

In this example, we show how to transfer files to/from a laptop or a desktop machine to/from FASRC.

  1. Set up a Guest Collection on it by installing Globus Connect Personal software
    Note: Premium Globus account is not required to transfer files between an institutional (FAS RC) shared collection and the Globus Connect Personal created collection.
  2. Once the installation is finished, go to Globus File Manager
  3. Find your newly created collection in the “Your Collections” tab
  4. Select files to transfer
  5. Specify the transfer settings
  6. Initiate the transfer by clicking on the large arrow icon
    Note: Make sure the Globus Connect Personal app is running and connected to the internet whenever you are transferring files. If you lose internet connection, the transfer will be paused and can be resumed at a later time.

Installing Globus Connect Personal on your computer

To share data from or to your local machine with other Globus users (not just with FASRC), you can install and run a Globus Connect Personal on your computer.

Using Globus With Tier 3 Tape

Please see the Globus: Transfer Data to Tape documentation.

Globus Docs and Videos

]]>
15555
Transferring files to DATA.RC https://docs.rc.fas.harvard.edu/kb/data-rc/ Mon, 08 Jun 2015 16:22:16 +0000 https://rc.fas.harvard.edu/?page_id=13526 Users of the data.rc.fas.harvard.edu server have three options for doing so. Please note that if you choose option 2, the settings for FTP/S are not the same as for regular SFTP settings you might use to transfer files to other servers.

The connection methods in order of preference:

  1. Via Web browser – This is the default means of accessing data.rc

     

  2. Filezilla (If you don’t have Filezilla installed, download here)
    • Open ‘Site Manager’ (screenshot) in Filezilla and create a New Site
      host: data.rc.fas.harvard.edu
      Protocol: SFTP
      Logon Type: Interactive
      User: [your RC username]
    • IMPORTANT
      Click the Transfer tab and check the Limit number of simultaneous connections box and set Maximum number of connections to “1”.(screenshot)
    • Click Connect to connect now, or OK if you’re setting up for later use

     

  3. Map a drive to the share (Drive mapping instructions).
    Note: This may be the easiest method, but it is also often the slowest method.
    \\rcstore02.rc.fas.harvard.edu\data (Windows)
    smb://rcstore02.rc.fas.harvard.edu/data (Mac)
]]>
13526
SFTP file transfer using Filezilla – Filtering https://docs.rc.fas.harvard.edu/kb/sftp-file-transfer-filtering/ Mon, 15 Sep 2014 12:31:09 +0000 https://rc.fas.harvard.edu/?page_id=12170 There may be times when you wish to filter the file listing in the local or remote pane. If you need to do this often, you may want to set up a filter. Unlike the search feature (binoculars icon), filters modify what is shown in the Remote Site: or Local Site: pane.
If you simply need to see files grouped together by name, date modified, filesize, etc. you do not need to use a filter, you can sort on those criteria using the attributes at the top of the file listing. Example: To sort based on date modified, click Last Modified. Click it again to reverse the sort (ascending/descending).
filezilla_filter_1
Warning-iconA NOTE ABOUT FILTERS: One of the pitfalls to using filters is forgetting they are enabled. Keep in mind that if you open up a session and files seem to be missing or oddly sorted, you may have left a filter engaged. Simply open Filename Filters and disable the filter to return to normal.

CREATING/EDITING A FILTER IN FILEZILLA

To create a filter, select View then Filename Filters from the main menu (or click its icon, 4th from the left of the ‘Search’ binoculars) to open the Directory Listing Filters window. Note that filter rules can be applied to either pane (local or remote).
Click Edit filter rules to create a new filter or edit an existing one.
filezilla_filter_2
Click New to add a new filter rule (or select an existing one if you wish to edit).
Give your new rule a name that will make sense to you later.
filezilla_filter_3
Set the criteria for your filter. You can add multiple conditions. In the example shown, only files and folders which begin with ‘Resource’ will be shown. I’ve also chosen to make the filter case-sensitive.
CAUTION: If you plan to change directories/folders with a filter enabled, you will likely want to not check the Directories box so that you can still see the directory structure. Otherwise, they may also be filtered out and you’ll have to turn the filter off in order to change directories.
filezilla_filter_4
Click OK to save the filter. You can now enable this new filter rule from the Directory listing filters window. Simply tick its check box (on whichever side you wish to apply it) and click OK to engage the filter.
filezilla_filter_5
Warning-iconCAUTION: It’s easy to forget you have a filter engaged. If you create or use filter rules in Filezilla, then you should first check to see if any are enabled if a directory/file listing does not look right or you don’t see files you expected to see.

]]>
12170
SFTP file transfer using Filezilla (Mac/Windows/Linux) https://docs.rc.fas.harvard.edu/kb/sftp-file-transfer/ Wed, 10 Sep 2014 12:06:46 +0000 https://rc.fas.harvard.edu/?page_id=12102 Filezilla is a free and open source SFTP client which is built on modern standards. It is available cross-platform (Mac, Windows and Linux) and is actively maintained. As such Research Computing is recommending its use over previous clients, especially as it does not have some of the quirks of clients like Cyberduck or SecureFX. This document will outline setting up a bookmark in Filezilla to connect to the cluster or other RC file resources you have access to. NOTE: If your SFTP session constantly disconnects after several seconds, see this FAQ entry.

Download and Install

First you will need to download and install the Filezilla client You can download the latest version from Filezilla-project.org NOTE: Please download from this page and not the big green button so as to avoid bundled adware. Linux users may be able to install Filezilla using their respective package manager.


IMPORTANT: If you have never logged into the cluster before, please insure you’ve gone through the setup process and set up your OpenAuth token before proceeding.

STEP 1

Once installed, launch Filezilla and click the Site Manager icon in the upper left to begin setting up a connection bookmark for future use.

STEP 2

Click New Site to add a new bookmark. Enter the connection details in the General tab.

  • Host:
    • If you are connecting to Cannon, enter login.rc.fas.harvard.edu
    • If you are connecting to FASSE, enter fasselogin.rc.fas.harvard.edu
  • Protocol: select SFTP – SSH File Transfer Protocol
  • Login Type: select Interactive (this is crucial, otherwise you will not be prompted for your OpenAuth token)
  • User: enter your RC account username
  • In newer versions of Filezilla, the password box will not exist, and in older versions of Filezilla it will be greyed out because we’re using Interactive login, which will instead prompt you for a password when you click Connect
  • Now click the Transfer tab

STEP 3

IMPORTANT Click the Transfer tab and check the Limit number of simultaneous connections box and set Maximum number of connections to “1”. Otherwise you will be prompted for your password and token each time the token expires and for every new simultaneous connection during file transfers.

OPTIONAL In the Advanced tab, select the local (i.e. – on your computer) directory/folder you’d like to start in when connecting. You can type this in or click the Browse button and find the directory you want. You can leave Default remote directory: blank if you simply wish to connect to your RC account’s home directory. Or, if you wish to connect to a specific directory (for instance, your lab’s shared storage or a particular folder in your home directory), you can enter this here.

Click Connect to initiate a connection. If you’re just making a bookmark for later, click OK. The first time you connect you will see a window titled “Unknown host key”. Check the “Always trust this host, add this key to the cache” box and click OK. This will store cluster key for future use.

STEP 4

A password prompt box will pop up. Enter your RC account password here.

  • Check “Remember password until FileZilla is closed”, otherwise it will prompt you periodically and interrupt transfers
  • Click OK

STEP 5

Another password box will pop up as. This is for your OpenAuth token. Enter the code shown in your OpenAuth token window (Or Google Authenticator or DUO Mobile, if you are using one of  alternative token generators) and click OK.

 STEP 6

You should now be connected to the cluster and see your local files in the left-hand pane and the remote files in the right-hand pane. You can drag and drop between them or drag and drop to/from file windows on your computer. When done, click the red X icon up top to disconnect.

ADVANCED TOPIC:  Filename filtering rules in Filezilla

]]>
12102
Transferring Data Externally https://docs.rc.fas.harvard.edu/kb/transferring-data/ Mon, 14 Jun 2010 20:28:38 +0000 http://rc-dev.rc.fas.harvard.edu/transferring-data/ There are different ways in which to transfer data to and from research computing facilities. The appropriate choice will depend on the size of your data, your need to secure it and also who you wish to share it with.

To copy the data to or from a location for yourself (or a collaborator who has a Research Computing account):

When sending data to a collaborator without an account on research computing systems:

  • For files (or folders) under 20GB in size that need to be sent to individuals please use the Accellion secure file transfer.
  • For large data sets and/or for access by external users, consider using Globus
  • For unsecured long-term publishing of data on the web contact rchelp@rc.fas.harvard.edu. We can make your data available (readable to the world) over a URL. Not recommended for very large data sets. If you wish to use this option, please let us know up the overall size up front.

Please contact rchelp@rc.fas.harvard.edu if your needs fall outside these directions.

]]>
5409
rsync https://docs.rc.fas.harvard.edu/kb/rsync/ Tue, 23 Mar 2010 20:07:21 +0000 http://rc-dev.rc.fas.harvard.edu/rsync/ Rsync is a fast, versatile, remote (and local) file-copying tool. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. It is available on most Unix-like systems, including the FAS RC cluster and Mac OS X.
The basic syntax is: rsync SOURCE DESTINATION where SOURCE and DESTINATION are filesystem paths.
They can be local, either absolute or relative to the current working directory, or they can be remote but prefixing something like USERNAME@HOSTNAME: to the front of them.
Unlike cp and most shell commands, a trailing / character on a directory name is significant — it means the contents of the directory as opposed to the directory itself.

Examples

  • As a replacement for cp — copying a single large file, but with a progress meter:
    rsync –progress bigfile bigfile-copy
  • Make a recursive copy of local directory foo as foo-copy:

    rsync -aAvx foo/ foo-copy/

    NOTE: Never use the capital -X option. Only the lowercase -x

    The trailing slash on foo-copy/ is optional, but if it’s not on foo/, the file foo/myfile will appear as foo-copy/foo/myfile instead of foo-copy/myfile.

  • Upload the directory foo on the local machine to your home directory on the cluster:
    rsync -avxz foo/ MYUSERNAME@login.rc.fas.harvard.edu:~/foo/

    This works for individual files, too, just don’t put the trailing slashes on them.

  • Download the directory foo in your home directory on the cluster to the local machine:
    rsync -avz MYUSERNAME@login.rc.fas.harvard.edu:~/foo .
  • Update a previously made copy of foo on the cluster after you’ve made changes to the local copy:
    rsync -avz –delete foo/ MYUSERNAME@login.rc.fas.harvard.edu:~/foo/

    The --delete option has no effect when making a new copy, and therefore can be used the previous example, too (making the commands identical), but since it recursively deletes files, it’s best to use it sparingly.

  • Update a previously made copy of foo on the cluster after you or someone else has already updated it from a different source:
    rsync -aAvz –update foo/ MYUSERNAME@login.rc.fas.harvard.edu:~/foo/

    The --update options has no effect when making a new copy, and can freely be specified in that case, also.

  • Make a backup of your entire linux system to /mnt/MYBACKUPDRIVE:
    rsync -a –exclude /proc/ –exclude /sys/ –exclude /tmp/ –exclude /var/tmp/ –exclude /mnt/ –exclude /media/ /mnt/MYBACKUPDRIVE

    Add additional --exclude options, if appropriate.
    See rdiff-backup for a better way of making backups.

Compression

If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred.
This will cause more CPU to be used on both ends, but it is usually faster.

File Attributes, Permissions, Ownership, etc.

By default, rsync does not copy recursively, preserve timestamps, preserve non-default permissions, etc.
There are individual options for all of these things, but the option -a, which is short for archive mode, sums up many of these (-rlptgoD) and is best for producing the most exact copy.
(-A (preserve ACLs), -X (preserve extended attributes), and -H (preserve hardlinks) may also be desired on rare occasions.)
Note that if you are copying files not owned by you, preserving file ownership only works if you are root at the destination. If you are copying between systems on different authentication infrastructures, and the user/group does not exist at the destination, the numeric id is used. If that numeric id corresponds to a different user/group, the files will appear to be owned by that other user/group. If the user/group does exist on the destination, and the numeric id is different, the numeric id changes accordingly. The option --numeric ids changes this behavior, but introduces some issues of its own, so is not recommended by default.

Updating a Copy

Rsync’s delta-transfer algorithm allows you to efficiently update copies you’ve previously made by only sending the differences needed to update the DESTINATION instead of re-copying it from scratch.
However, there are some addition options you will probably want to use depending on the type of copy you’re trying to maintain.
If you want to maintain a mirror, i.e. the DESTINATION is to be an exact copy of the SOURCE, then you will want to add the --delete option.
This deletes stuff in the DESTINATION that is no longer in the SOURCE
Be careful with this option!
If you incorrectly specify the DESTINATION you may accidentally delete many files.
See also the --delete-excluded option if you’re adding --exclude options that were not used when making the original copy.
If you’re updating a master copy, i.e. the DESTINATION may have files that are newer than the versions in SOURCE, you will want to add the --update option.
This will leave those files alone, not revert them to the older copy in SOURCE.

Progress, Verbosity, Statistics

  • -v
    Verbose mode — list each file transferred.
    Adding more vs makes it more verbose.
  • --progress
    Show a progress meter for each file transfer (not a progress meter for the whole operation).
    If you have many small files, this can significantly slow down the transfer.
  • --stats
    Print a short paragraph of statistics at the end of the session, like average transfer rate, total numbers of files transferred, etc.

Other Useful Options

  • --dry-run
    Perform a dry-run of the session instead of actually modifying the DESTINATION.
    Most useful when adding multiple -v options, especially for verifying --delete is doing what you want.
  • --exclude PATTERN
    Skip some parts of the SOURCE.
]]>
5400