Transferring files with remote computers
Questions
- How do I transfer files to (and from) the cluster?
Objectives
- Transfer files to and from a computing cluster.
Performing work on a remote computer is not very useful if we cannot get files to or from the cluster. There are several options for transferring data between computing resources using CLI and GUI utilities, a few of which we will cover.
Download Lesson Files From the Internet
One of the most straightforward ways to download files is to use either curl or wget. One of these is usually installed in most Linux shells, on Mac OS terminal and in GitBash. Any file that can be downloaded in your web browser through a direct link can be downloaded using curl or wget. This is a quick way to download datasets or source code. The syntax for these commands is:
wget [-O new_name] https://some/link/to/a/filecurl [-o new_name] https://some/link/to/a/file
Try it out by downloading some material we’ll use later on, from a terminal on your local machine, using the URL of the current codebase:
https://github.com/hpc-carpentry/amdahl/tarball/main
Download the “Tarball”The word “tarball” in the above URL refers to a compressed archive format commonly used on Linux, which is the operating system the majority of HPC cluster machines run. A tarball is a lot like a .zip file. The actual file extension is .tar.gz, which reflects the two-stage process used to create the file: the files or folders are merged into a single file using tar, which is then compressed using gzip, so the file extension is “tar-dot-g-z.” That’s a mouthful, so people often say “the xyz tarball” instead.
You may also see the extension .tgz, which is just an abbreviation of .tar.gz.
By default, curl and wget download files to the same name as the URL: in this case, main. Use one of the above commands to save the tarball as amdahl.tar.gz.
[Click for Solution]
The word “tarball” in the above URL refers to a compressed archive format commonly used on Linux, which is the operating system the majority of HPC cluster machines run. A tarball is a lot like a .zip file. The actual file extension is .tar.gz, which reflects the two-stage process used to create the file: the files or folders are merged into a single file using tar, which is then compressed using gzip, so the file extension is “tar-dot-g-z.” That’s a mouthful, so people often say “the xyz tarball” instead.
You may also see the extension .tgz, which is just an abbreviation of .tar.gz.
By default, curl and wget download files to the same name as the URL: in this case, main. Use one of the above commands to save the tarball as amdahl.tar.gz.
[Click for Solution]
[user@laptop ~]$ wget -O amdahl.tar.gz https://github.com/hpc-carpentry/amdahl/tarball/main
# or
[user@laptop ~]$ curl -o amdahl.tar.gz -L https://github.com/hpc-carpentry/amdahl/tarball/main
The -L option to curl tells it to follow URL redirects (which wget does by default).
After downloading the file, use ls to see it in your working directory:
[user@laptop ~]$ ls
Archiving Files
One of the biggest challenges we often face when transferring data between remote HPC systems is that of large numbers of files. There is an overhead to transferring each individual file and when we are transferring large numbers of files these overheads combine to slow down our transfers to a large degree.
The solution to this problem is to archive multiple files into smaller numbers of larger files before we transfer the data to improve our transfer efficiency. Sometimes we will combine archiving with compression to reduce the amount of data we have to transfer and so speed up the transfer. The most common archiving command you will use on a (Linux) HPC cluster is tar.
tar can be used to combine files and folders into a single archive file and, optionally, compress the result. Let’s look at the file we downloaded from the lesson site, amdahl.tar.gz.
The .gz part stands for gzip, which is a compression library. It’s common (but not necessary!) that this kind of file can be interpreted by reading its name: it appears somebody took files and folders relating to something called “amdahl,” wrapped them all up into a single file with tar, then compressed that archive with gzip to save space.
Let’s see if that is the case, without unpacking the file. tar prints the “table of contents” with the -t flag, for the file specified with the -f flag followed by the filename. Note that you can concatenate the two flags: writing -t -f is interchangeable with writing -tf together. However, the argument following -f must be a filename, so writing -ft will not work.
[user@laptop ~]$ tar -tf amdahl.tar.gz
hpc-carpentry-amdahl-46c9b4b/
hpc-carpentry-amdahl-46c9b4b/.github/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/python-publish.yml
hpc-carpentry-amdahl-46c9b4b/.gitignore
hpc-carpentry-amdahl-46c9b4b/LICENSE
hpc-carpentry-amdahl-46c9b4b/README.md
hpc-carpentry-amdahl-46c9b4b/amdahl/
hpc-carpentry-amdahl-46c9b4b/amdahl/__init__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/__main__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/amdahl.py
hpc-carpentry-amdahl-46c9b4b/requirements.txt
hpc-carpentry-amdahl-46c9b4b/setup.py
This example output shows a folder which contains a few files, where 46c9b4b is an 8-character git commit hash that will change when the source material is updated.
Now let’s unpack the archive. We’ll run tar with a few common flags:
-xto extract the archive-vfor verbose output-zfor gzip compression-f «tarball»for the file to be unpacked
Using the flags above, unpack the source code tarball into a new directory named “amdahl” using tar.
[user@laptop ~]$ tar -xvzf amdahl.tar.gz
hpc-carpentry-amdahl-46c9b4b/
hpc-carpentry-amdahl-46c9b4b/.github/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/python-publish.yml
hpc-carpentry-amdahl-46c9b4b/.gitignore
hpc-carpentry-amdahl-46c9b4b/LICENSE
hpc-carpentry-amdahl-46c9b4b/README.md
hpc-carpentry-amdahl-46c9b4b/amdahl/
hpc-carpentry-amdahl-46c9b4b/amdahl/__init__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/__main__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/amdahl.py
hpc-carpentry-amdahl-46c9b4b/requirements.txt
hpc-carpentry-amdahl-46c9b4b/setup.py
Note that we did not need to type out -x -v -z -f, thanks to flag concatenation, though the command works identically either way – so long as the concatenated list ends with f, because the next string must specify the name of the file to extract.
The folder has an unfortunate name, so let’s change that to something more convenient.
[user@laptop ~]$ mv hpc-carpentry-amdahl-46c9b4b amdahl
Check the size of the extracted directory and compare to the compressed file size, using du for “disk usage”.
[user@laptop ~]$ du -sh amdahl.tar.gz
8.0K amdahl.tar.gz
[user@laptop ~]$ du -sh amdahl
48K amdahl
Text files (including Python source code) compress nicely: the “tarball” is one-sixth the total size of the raw data!
If you want to reverse the process – compressing raw data instead of extracting it – set a c flag instead of x, set the archive filename, then provide a directory to compress:
[user@laptop ~]$ tar -cvzf compressed_code.tar.gz amdahl
amdahl/
amdahl/.github/
amdahl/.github/workflows/
amdahl/.github/workflows/python-publish.yml
amdahl/.gitignore
amdahl/LICENSE
amdahl/README.md
amdahl/amdahl/
amdahl/amdahl/__init__.py
amdahl/amdahl/__main__.py
amdahl/amdahl/amdahl.py
amdahl/requirements.txt
amdahl/setup.py
If you give amdahl.tar.gz as the filename in the above command, tar will update the existing tarball with any changes you made to the files. That would mean adding the new amdahl folder to the existing folder (hpc-carpentry-amdahl-46c9b4b) inside the tarball, doubling the size of the archive!
When you transfer text files from a Windows system to a Unix system (Mac, Linux, BSD, Solaris, etc.) this can cause problems. Windows encodes its files slightly different than Unix, and adds an extra character to every line.
On a Unix system, every line in a file ends with a \n (newline). On Windows, every line in a file ends with a \r\n (carriage return + newline). This causes problems sometimes.
Though most modern programming languages and software handles this correctly, in some rare instances, you may run into an issue. The solution is to convert a file from Windows to Unix encoding with the dos2unix command.
You can identify if a file has Windows line endings with cat -A filename. A file with Windows line endings will have ^M$ at the end of every line. A file with Unix line endings will have $ at the end of a line.
To convert the file, just run dos2unix filename. (Conversely, to convert back to Windows format, you can run unix2dos filename.)
Transferring Single Files and Folders With scp
To copy a single file to or from the cluster, we can use scp (“secure copy”). The syntax can be a little complex for new users, but we’ll break it down. The scp command is a relative of the ssh command we used to access the system, and can use the same public-key authentication mechanism.
To upload to another computer, the template command is
[user@laptop ~]$ scp local_file NetID@greene.hpc.nyu.edu:remote_destination
in which @ and : are field separators and remote_destination is a path relative to your remote home directory, or a new filename if you wish to change it, or both a relative path and a new filename. If you don’t have a specific folder in mind you can omit the remote_destination and the file will be copied to your home directory on the remote computer (with its original name). If you include a remote_destination, note that scp interprets this the same way cp does when making local copies: if it exists and is a folder, the file is copied inside the folder; if it exists and is a file, the file is overwritten with the contents of local_file; if it does not exist, it is assumed to be a destination filename for local_file.
Upload the lesson material to your remote home directory like so:
[user@laptop ~]$ scp amdahl.tar.gz NetID@greene.hpc.nyu.edu:
Transferring a Directory
To transfer an entire directory, we add the -r flag for “recursive”: copy the item specified, and every item below it, and every item below those… until it reaches the bottom of the directory tree rooted at the folder name you provided.
[user@laptop ~]$ scp -r amdahl NetID@greene.hpc.nyu.edu:
For a large directory – either in size or number of files – copying with -r can take a long time to complete.
When using scp, you may have noticed that a : always follows the remote computer name. A string after the : specifies the remote directory you wish to transfer the file or folder to, including a new name if you wish to rename the remote material. If you leave this field blank, scp defaults to your home directory and the name of the local material to be transferred.
On Linux computers, / is the separator in file or directory paths. A path starting with a / is called absolute, since there can be nothing above the root /. A path that does not start with / is called relative, since it is not anchored to the root.
If you want to upload a file to a location inside your home directory – which is often the case – then you don’t need a leading /. After the :, you can type the destination path relative to your home directory. If your home directory is the destination, you can leave the destination field blank, or type ~ – the shorthand for your home directory – for completeness.
With scp, a trailing slash on the target directory is optional, and has no effect. A trailing slash on a source directory is important for other commands, like rsync.
As you gain experience with transferring files, you may find the scp command limiting. The rsync utility provides advanced features for file transfer and is typically faster compared to both scp and sftp (see below). It is especially useful for transferring large and/or many files and for synchronizing folder contents between computers.
The syntax is similar to scp. To transfer to another computer with commonly used options:
[user@laptop ~]$ rsync -avP amdahl.tar.gz NetID@greene.hpc.nyu.edu:
The options are:
-a(archive) to preserve file timestamps, permissions, and folders, among other things; implies recursion-v(verbose) to get verbose output to help monitor the transfer-P(partial/progress) to preserve partially transferred files in case of an interruption and also displays the progress of the transfer.
To recursively copy a directory, we can use the same options:
[user@laptop ~]$ rsync -avP amdahl NetID@greene.hpc.nyu.edu:~/
As written, this will place the local directory and its contents under your home directory on the remote system. If a trailing slash is added to the source, a new directory corresponding to the transferred directory will not be created, and the contents of the source directory will be copied directly into the destination directory.
To download a file, we simply change the source and destination:
[user@laptop ~]$ rsync -avP NetID@greene.hpc.nyu.edu:amdahl ./
File transfers using both scp and rsync use SSH to encrypt data sent through the network. So, if you can connect via SSH, you will be able to transfer files. By default, SSH uses network port 22. If a custom SSH port is in use, you will have to specify it using the appropriate flag, often -p, -P, or --port. Check --help or the man page if you’re unsure.
Change the Rsync PortSay we have to connect rsync through port 768 instead of 22. How would we modify this command?
[user@laptop ~]$ rsync amdahl.tar.gz NetID@greene.hpc.nyu.edu:
Hint: check the man page or “help” for rsync.
[Click for Solution]
Say we have to connect rsync through port 768 instead of 22. How would we modify this command?
[user@laptop ~]$ rsync amdahl.tar.gz NetID@greene.hpc.nyu.edu:
Hint: check the man page or “help” for rsync.
[Click for Solution]
[user@laptop ~]$ man rsync
[user@laptop ~]$ rsync --help | grep port
--port=PORT specify double-colon alternate port number
See http://rsync.samba.org/ for updates, bug reports, and answers
[user@laptop ~]$ rsync --port=768 amdahl.tar.gz NetID@greene.hpc.nyu.edu:
This command will fail, as the correct port in this case is the default: 22.
Transferring Files Interactively with FileZilla
FileZilla is a cross-platform client for downloading and uploading files to and from a remote computer. It is absolutely fool-proof and always works quite well. It uses the sftp protocol. You can read more about using the sftp protocol in the command line in the lesson discussion.
Download and install the FileZilla client from https://filezilla-project.org. After installing and opening the program, you should end up with a window with a file browser of your local system on the left hand side of the screen. When you connect to the cluster, your cluster files will appear on the right hand side.
To connect to the cluster, we’ll just need to enter our credentials at the top of the screen:
- Host:
sftp://greene.hpc.nyu.edu - User: Your NetID
- Password: Your NetID password
- Port: (leave blank to use the default port)
Hit “Quickconnect” to connect. You should see your remote files appear on the right hand side of the screen. You can drag-and-drop files between the left (local) and right (remote) sides of the screen to transfer files.
Finally, if you need to move large files (typically larger than a gigabyte) from one remote computer to another remote computer, SSH in to the computer hosting the files and use scp or rsync to transfer over to the other. This will be more efficient than using FileZilla (or related applications) that would copy from the source to your local machine, then to the destination machine.
Limiting Factors of Data Transfer Speed
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going. The components between your data’s source and destination have varying levels of performance, and in particular, may have different capabilities with respect to bandwidth and latency.
Bandwidth is generally the raw amount of data per unit time a device is capable of transmitting or receiving. It’s a common and generally well-understood metric.
Latency is a bit more subtle. For data transfers, it may be thought of as the amount of time it takes to get data out of storage and into a transmittable form. Latency issues are the reason it’s advisable to execute data transfers by moving a small number of large files, rather than the converse.
Some of the key components and their associated issues are:
- Disk speed: File systems on HPC systems are often highly parallel, consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file system you may find your transfer speed limited by disk performance at that end.
- Meta-data performance: Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system can also interact strongly with those you perform so reducing the number of such operations you use (by combining multiple files into a single file) may reduce variability in your transfer rates and increase transfer speeds.
- Network speed: Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination. If you are transferring to your laptop/workstation, this is likely to be its connection (either via LAN or WiFi).
- Firewall speed: Most modern networks are protected by some form of firewall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the firewall for that network will limit the transfer rate you can achieve.
As mentioned above, if you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger archive file for long term storage and transfer. A single large file makes more efficient use of the file system and is easier to move, copy and transfer because significantly fewer metadata operations are required. Archive files can be created using tools like tar and zip. We have already met tar when we talked about data transfer earlier.
Schematic diagram of bandwidth and latency for disk and network I/O. Each of the components on the figure is connected by a blue line of width proportional to the interface bandwidth.
The small mazes at the link points illustrate the latency of the link, with more tortuous mazes indicating higher latency.
Consider the Best Way to Transfer DataIf you are transferring large amounts of data you will need to think about what may affect your transfer performance. It is always useful to run some tests that you can use to extrapolate how long it will take to transfer your data.
Say you have a “data” folder containing 10,000 or so files, a healthy mix of small and large ASCII and binary data. Which of the following would be the best way to transfer them to Greene?
[user@laptop ~]$ scp -r data NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ rsync -ra data NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ rsync -raz data NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ tar -cvf data.tar data
[user@laptop ~]$ rsync -raz data.tar NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ tar -cvzf data.tar.gz data
[user@laptop ~]$ rsync -ra data.tar.gz NYUNetID@greene.hpc.nyu.edu:~/
[Click for Solution]
If you are transferring large amounts of data you will need to think about what may affect your transfer performance. It is always useful to run some tests that you can use to extrapolate how long it will take to transfer your data.
Say you have a “data” folder containing 10,000 or so files, a healthy mix of small and large ASCII and binary data. Which of the following would be the best way to transfer them to Greene?
[user@laptop ~]$ scp -r data NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ rsync -ra data NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ rsync -raz data NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ tar -cvf data.tar data
[user@laptop ~]$ rsync -raz data.tar NYUNetID@greene.hpc.nyu.edu:~/
[user@laptop ~]$ tar -cvzf data.tar.gz data
[user@laptop ~]$ rsync -ra data.tar.gz NYUNetID@greene.hpc.nyu.edu:~/
[Click for Solution]
scpwill recursively copy the directory. This works, but without compression.rsync -raworks likescp -r, but preserves file information like creation times. This is marginally better.rsync -razadds compression, which will save some bandwidth. If you have a strong CPU at both ends of the line, and you’re on a slow network, this is a good choice.- This command first uses
tarto merge everything into a single file, thenrsync -zto transfer it with compression. If you have a large number of files, metadata overhead can hamper your transfer, so this is a good idea. - This command uses
tar -zto compress the archive, thenrsyncto transfer it. This may perform similarly to the command directly above, but in most cases (for large datasets), it’s the best combination of high throughput and low latency (making the most of your time and network connection).
wgetandcurl -Odownload a file from the internet.scpandrsynctransfer files to and from your computer.- You can use an SFTP client like FileZilla to transfer files through a GUI.