Using shared resources responsibly

Overview

Questions

How can I be a responsible user?
How can I protect my data?
How can I best get large amounts of data off an HPC system?

Objectives

Describe how the actions of a single user can affect the experience of others on a shared system.
Discuss the behaviour of a considerate shared system citizen.
Explain the importance of backing up critical data.
Describe the challenges with transferring large amounts of data off HPC systems.
Convert many files to a single archive file using tar.

One of the major differences between using remote HPC resources and your own system (e.g. your laptop) is that remote resources are shared. How many users the resource is shared between at any one time varies from system to system, but it is unlikely you will ever be the only user logged into or using such a system.

The widespread usage of scheduling systems where users submit jobs on HPC resources is a natural outcome of the shared nature of these resources. There are other things you, as an upstanding member of the community, need to consider.

The login node is often busy managing all of the logged in users, creating and editing files and compiling software. If the machine runs out of memory or processing capacity, it will become very slow and unusable for everyone. While the machine is meant to be used, be sure to do so responsibly – in ways that will not adversely impact other users’ experience.

Login nodes are always the right place to launch jobs, but data transfers should be done on the Greene Data Transfer Nodes (gDTNs). Please see more about gDTNs at Data Transfers. Similarly, computationally intensive tasks should all be done on compute nodes. This refers to not just computational analysis/research tasks, but also to processor intensive software installations and similar tasks.

Remember, the login node is shared with all other users and your actions could cause issues for other people. Think carefully about the potential implications of issuing commands that may use large amounts of resource.

Unsure? Ask your friendly systems administrator (“sysadmin”) if the thing you’re contemplating is suitable for the login node, or if there’s another mechanism to get it done safely. Please email hpc@nyu.edu with questions.

You can always use the commands top and ps ux to list the processes that are running on the login node along with the amount of CPU and memory they are using. If this check reveals that the login node is somewhat idle, you can safely use it for your non-routine processing task. If something goes wrong – the process takes too long, or doesn’t respond – you can use the kill command along with the PID to terminate the process.

Which of these commands would be a routine task to run on the login node?

python physics_sim.py
make
create_directories.sh
molecular_dynamics_2
tar -xzf R-3.3.0.tar.gz

[Click for Solution]

Solution

Building software, creating directories, and unpacking software are common and acceptable tasks for the login node: options #2 (make), #3 (mkdir), and #5 (tar) are probably OK.

note

Script names do not always reflect their contents though, so before launching #3, please less create_directories.sh and make sure it’s not a Trojan horse.

Running resource-intensive applications is frowned upon. Unless you are sure it will not affect other users, do not run jobs like #1 (python) or #4 (custom MD code).
If you’re unsure, ask your friendly sysadmin for advice by emailing hpc@nyu.edu.

If you experience performance issues with a login node you should report it to the system staff by sending email to hpc@nyu.edu for them to investigate.

Test Before Scaling

Remember that you are generally charged for usage on shared systems. A simple mistake in a job script can end up costing a large amount of resource budget. Imagine a job script with a mistake that makes it sit doing nothing for 24 hours on 1000 cores or one where you have requested 2000 cores by mistake and only use 100 of them! This problem can be compounded when people write scripts that automate job submission (for example, when running the same calculation or analysis over lots of different parameters or files). When this happens it hurts both you (as you waste lots of charged resource) and other users (who are blocked from accessing the idle compute nodes). On very busy resources you may wait many days in a queue for your job to fail within 10 seconds of starting due to a trivial typo in the job script. This is extremely frustrating!

Test Job Submission Scripts That Use Large Amounts of Resources

We suggest that you test a subset of you data or analysis on an interactive node prior to running full batch jobs. This way you can request a smaller set of resources and time which should decrease your wait time in the queue, and you'll be able to quickly iterate code refactoring in interactive mode. When you've got everything working well on smaller problems you can submit batch jobs for larger ones. Even after doing the above it's often wise to start with a small batch job first to make sure that you don't have any errors in your batch script.

Have a Backup Plan

Version control systems (such as Git) often have free, cloud-based offerings (e.g., GitHub and GitLab) that are generally used for storing source code. Even if you are not writing your own programs, these can be very useful for storing job scripts, analysis scripts and small input files. This can provide a layer of redundant protection for some of your files.

While the Greene HPC system does offer some backups, it is important to understand which storage options are backed up and what the limitations of those backups are. Please see HPC Storage for details.

It is also important to remember that your access to the shared HPC system will generally be time-limited, so you should ensure you have a plan for transferring your data off the system before your access finishes. The time required to transfer large amounts of data should not be underestimated and you should ensure you have planned for this early enough (ideally, before you even start using the system for your research).

In all these cases, please contact hpc@nyu.edu if you have questions about data transfer and storage for the volumes of data you will be using.

Your Data Is Your Responsibility

Make sure you understand what the backup policy is on the system you are using and what implications this has for your work if you lose your data on the system. Plan your own personal backups of critical data and how you will transfer data off the system throughout the project.

Transferring Data

The most important point about transferring data responsibly on Green is to be sure to use Greene Data Transfer Nodes (gDTNs) or other options like Globus. Please see Data Transfers for details. By doing this you'll help to keep the login nodes responsive for all users.

Being efficient in how you transfer data on the gDTNs is also important. It will not only reduce the load on the gDTNs, but also save your time. Be sure to archive and compress you files if possible with tar and gzip. This will remove the overhead of trying to transfer many files and shrink the size of transfer. Please see Transferring Files with Remote Computers for details.

Key Points

Be careful how you use the login node.
Your data on the system is your responsibility.
Always use Greene Data Transfer Nodes (gDTNs) for large data transfers.
Plan and test large data transfers.
It is often best to convert many files to a single archive file before transferring.

Be Kind to the Login Nodes​

Test Before Scaling​

Have a Backup Plan​

Transferring Data​

Be Kind to the Login Nodes

Test Before Scaling

Have a Backup Plan

Transferring Data