[HPC 101] Data Transfer: How to Move Files In and Out

5 minute read

Published:

We have the computing power. Now we need the Data.

Moving files between your local machine (laptop/workstation) and the HPC cluster is a daily routine for researchers. You have your code, input data, and eventually, the results. This guide covers the best practices for file transfer, from “packing” your files to handling massive datasets.

Table of Contents


> 1. The Golden Rule: Pack Before You Move

Think of this process like moving into a new house.

In the previous post, we compared the HPC cluster to a Hotel. Your laptop is your old house. Now, you need to move your belongings (data) to the new place.

Imagine you have 10,000 pairs of socks (small data files). Would you carry them one by one to the moving truck? No, that would take forever. You would put them in a box first.

In HPC, transferring thousands of small files individually kills network performance due to overhead. Always archive your folder first.

Choose Your Box: Tar vs. Zip

# Packing (create archive)
$ tar -czf my_data.tar.gz my_folder
# -c: Create
# -z: Gzip compression
# -f: File name

# Unpacking (extract archive)
$ tar -xf my_data.tar.gz
# -x: Extract
# -f: File name
# (No need to type -z, tar detects it automatically)
# Packing (create archive)
$ zip -r my_data.zip my_folder
# -r: Recursive (includes all subdirectories)

# Unpacking (extract archive)
$ unzip my_data.zip


> 2. Direct Download (Web to HPC)

Scenario: Your data is hosted on a website (GitHub, Kaggle, Data Portal).

Do not download it to your laptop just to upload it again to the cluster. That is double work. Just order your “delivery” directly to the hotel (Cluster)!

Use wget or curl on the login node to pull files directly.

# Option 1: Using wget
$ wget https://example.com/dataset.tar.gz

# Option 2: Using curl
$ curl -o dataset.tar.gz https://example.com/dataset.tar.gz


> 3. Transfer Tools: SCP vs. Rsync

Scenario: The files are on your laptop. (Note: Run these commands on your Local Terminal, not inside the cluster.)

SCP (The “Throw”)

If you have a small file or a single packed archive, use scp (Secure Copy). It is simple and quick.

# Upload: Laptop -> Cluster
$ scp my_data.tar.gz <USER>@<HOST_NAME>:~/
# Example: scp data.tar.gz [email protected]:~/

# Download: Cluster -> Laptop
$ scp <USER>@<HOST_NAME>:~/results.tar.gz ./
# Example: scp [email protected]:~/data.tar.gz ./

Rsync (The “Smart Mover”)

What if your file is huge (e.g., 100GB)? And what if your WiFi disconnects at 99%? scp will fail, and you have to start from 0%. That is a nightmare.

Use rsync. It checks the difference between source and destination. If the connection drops, it resumes from where it left off.

$ rsync -azP my_big_data/ <USER>@<CLUSTER>:~/data/
# Example: rcp -azP data_tar.gz [email protected]:~/

Understanding the flags (-azP):

  • -a: Archive mode. Preserves permissions, timestamps, and symbolic links.
  • -z: Compress file data during the transfer for faster speed.
  • -P: Shows Progress bar and allows Partial transfer (Resuming).

Rule of Thumb:

  • Small file? Use SCP.
  • Big file or Unstable network? Use Rsync.


> 4. GUI Clients (WinSCP & FileZilla)

“I hate the terminal. Can I just drag and drop?”

Yes, you can! If you are not comfortable with command-line tools yet, or if you just want to browse files visually, use an SFTP Client.

How to Connect

The settings are exactly the same as your SSH connection.

  1. File Protocol: SFTP
  2. Host name: Your cluster address (e.g., login.university.edu)
  3. Port number: 22 (Default SSH port)
  4. User/Password: Your credentials

Once connected, you will see your laptop’s files on the left and the cluster’s files on the right. Just drag and drop to transfer!

Note for Globus Users: If you need to transfer massive datasets (Terabytes/Petabytes) between institutions, ask your system administrator about Globus. It is a high-performance transfer service often supported by research centers. It’s much faster and more reliable than SCP/SFTP for massive data.


> 5. Code Management with Git

Scenario: Moving your Python/C++ scripts.

Should you use rsync for your code? You can, but please don’t. Treat your code like books in a library. Use Git.

  1. Laptop: Push your code to GitHub/GitLab.
  2. Cluster: Clone or Pull the repository.
# On the Cluster
$ git clone https://github.com/username/my-project.git

This keeps your version history safe and makes collaboration much easier.


> 6. Storage Quota

Warning: Remember the “Hotel Room” analogy? Your room has an occupancy limit. We call it Quota.

If you fill up your disk space, your jobs will crash immediately, and you might not even be able to save a file.

How to check? Commands vary by institution. Common examples include:

  • $ quota -s
  • $ lfs quota -h ~
  • $ check_usage

Please check your user documentation or ask your help desk for the specific command. Always check your available space before transferring a massive dataset.


Summary

  1. Pack your small files (tar or zip).
  2. Use wget for web data.
  3. Use scp for quick, small transfers.
  4. Use rsync -azP for large, robust transfers.
  5. Use git for code.

Nice job! You have learned how to prepare your data. In the next post, we will learn how to manage software environments using Conda.

Happy Computing!