Skip to main content

Torch Utility Applications

Torch has several utility applications that can give you information related to your account and jobs on the cluster:

myquota

Users can check their current utilization of quota using the myquota command. The myquota command provides a report of the current quota limits on mounted filesystems, the user's quota utilization, as well as the percentage of quota utilization.

In the following example the user who executes the myquota command is out of inodes in their home directory. The user inode quota limit on the /home file system 30.0K inodes and the user has 33000 inodes, thus 110% of the inode quota limit.

$ myquota
Quota Information for NetID
Hostname: torch-login-2 at 2025-12-09 17:18:24

Filesystem Environment Backed up? Allocation Current Usage
Space Variable /Flushed? Space / Files Space(%) / Files(%)

/home $HOME YES/NO 0.05TB/0.03M 0.0TB(0.0%)/33000(110%)
/scratch $SCRATCH NO/YES 5.0TB/5.0M 0.0TB(0.0%)/1(0%)
/archive $ARCHIVE YES/NO 2.0TB/0.02M 0.0TB(0.0%)/1(0%)

my_slurm_accounts

my_slurm_accounts returns a list of SLURM accounts associated with your HPC account:

[NetID@torch-login-b-1 ~]$ my_slurm_accounts
Account Descr
-------------------------------- ------------------------------------------------------------
torch_pr_XXX_XXXXX project description

Use the appropriate entry in the Account column for the job you are submitting.
You will need to specify the account on the command line like:

srun --account=torch_pr_XXX_XXXXX --pty bash
sbatch -c4 -t2:00:00 --mem=4G --account=torch_pr_XXX_XXXXX my_script.sh

or in your sbatch file you'll need to add a line like:

#SBATCH --account=torch_pr_XXX_XXXXX

You'll need to modify the above to use your actual account.

Please see Slurm: Command reference for details.

For more information about slurm accounts please see Slurm Accounts.

nvidia-smi

nvidia-smi (NVIDIA System Management Interface) is a command-line utility, based on the NVIDIA Management Library (NVML), used to monitor and manage NVIDIA GPU devices

It will provide detailed information like:

  • GPU utilization
  • Memory usage
  • P-States: Performance states from P0 (max performance) to P12 (minimum idle)
  • device details like power consumption and temperature
tip

You can get output refreshed every 5 seconds with:

nvidia-smi -l 5

Alternatively, you can use:

/share/apps/images/run-nvtop-3.2.0.bash nvtop
tip

You can get very detailed information about the GPU with:

[NetID@gl001 ~]$ nvidia-smi -q

sdiag

A scheduling diagnostic tool for Slurm. It shows information related to slurmctld execution about: threads, agents, jobs, and scheduling algorithms.

[NetID@torch-login-b-2 ~]$ sdiag
*******************************************************
sdiag output at Tue May 05 17:03:58 2026 (1778015038)
Data since Mon May 04 20:00:00 2026 (1777939200)
*******************************************************
Server thread count: 1
RPC queue enabled: 0
Agent queue size: 0
Agent count: 0
Agent thread count: 0
DBD Agent queue size: 3928

Jobs submitted: 68600
Jobs started: 47338
Jobs completed: 46650
Jobs canceled: 3004
Jobs failed: 0

Job states ts: Tue May 05 17:03:55 2026 (1778015035)
Jobs pending: 25570
Jobs running: 3531

Main schedule statistics (microseconds):
Last cycle: 439197
Max cycle: 213062424
Total cycles: 23887
Mean cycle: 147838
Mean depth cycle: 3901
Cycles per minute: 18
Last queue length: 8588

Main scheduler exit:
End of job queue:22329
Hit default_queue_depth: 0
Hit sched_max_job_start: 0
Blocked on licenses: 0
Hit max_rpc_cnt: 0
Timeout (max_sched_time):761

Backfilling stats
Total backfilled jobs (since last slurm start): 2491
Total backfilled jobs (since last stats cycle start): 2144
Total backfilled heterogeneous job components: 0
Total cycles: 1132
etc....

Backfill exit
End of job queue: 0
Hit bf_max_job_start: 0
Hit bf_max_job_test:1117
System state changed:15
Hit table size limit (bf_node_space_size): 0
Timeout (bf_max_time): 0

Latency for 1000 calls to gettimeofday(): 37 microseconds

Remote Procedure Call statistics by message type
REQUEST_PARTITION_INFO ( 2009) count:265039 ave_time:862 total_time:228596549
REQUEST_JOB_INFO_SINGLE ( 2021) count:148070 ave_time:177411 total_time:26269388019
REQUEST_FED_INFO ( 2049) count:109863 ave_time:131 total_time:14500296
etc....

Remote Procedure Call statistics by user
root ( 0) count:641281 ave_time:136950 total_time:87823642922
NetID1 ( 3316908) count:35412 ave_time:206086 total_time:7297919561
NetID2 ( 4704548) count:32959 ave_time:152431 total_time:5023999046
NetID3 ( 3511186) count:32373 ave_time:102614 total_time:3321933704
etc....

Pending RPC statistics
No pending RPCs
warning

Being high on the list in Remote Procedure Call statistics by user can cause you to be throttled by Slurm for using too many resources.

tip

If you find yourself in this position please try to reduce the number of calls you make to slurm services like squeue and sacct. Do not use these commands with watch. As an alternative you can use the slurm mail-type flag to see when jobs start and end.

If you're running a number of similar jobs, please look into using array jobs as this will reduce your procedure call statistics.

seff

The seff script can be used to display status information about a user’s historical or running jobs.

Here's example output for a job:

[NetID@torch-login-b-1 ~]$ seff 6239104
Job ID: 6239104
Cluster: torch
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 5
CPU Utilized: 00:00:07
CPU Efficiency: 28.00% of 00:00:25 core-walltime
Job Wall-clock time: 00:00:05
Memory Utilized: 50.54 MB
Memory Efficiency: 4.94% of 1.00 GB

As you can see above, seff gives information about CPU and memory efficiency to help you more efficiently use our cluster resources.

tip

Requesting the minimum resources needed for your job can help it spend less time in the queue.

show_slurm_qos

This shows the maximum number of cpus/gpus and memory allowed for different wall times.

[NetID@torch-login-b-0 ~]$ show_slurm_qos
Name MaxWall MaxTRESPU Preempt PreemptExemptTime PreemptMode
------------------------ ----------- ------------------------ ---------- ------------------- -----------
cpu_short 06:00:00 cpu=32,mem=120G cluster
cpu168 7-00:00:00 cpu=1000,mem=2000G cluster
cpu48 2-00:00:00 cpu=3000,mem=6000G cluster
cpuprem 2-00:00:00 cpu=30000,mem=120000G cluster
gpu168 7-00:00:00 gres/gpu=4 cluster
gpu48 2-00:00:00 gres/gpu=16 cluster
interactive 06:00:00 cpu=16,mem=60G cluster

You can see that, in general, the partitions with shorter wall times will allow the use of greater resources.