Job Efficiency – FASRC DOCS https://docs.rc.fas.harvard.edu Tue, 09 Dec 2025 20:39:38 +0000 en-US hourly 1 https://wordpress.org/?v=6.9 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png Job Efficiency – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 jobstats https://docs.rc.fas.harvard.edu/kb/jobstats/ Tue, 25 Nov 2025 20:41:28 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=29251 Overview

The Princeton Jobstats platform provides profile and summary information for jobs on FASRC Clusters. This allows for greater insight into job performance than the standard Slurm commands. It is highly encouraged to use jobstats over the older seff command, especially as jobstats gives information on GPU usage. Jobstats works for both running and completed jobs, but does not work for jobs that last for under a minute.

Command

To use jobstats run:

jobstats JOBID

You will then get a summary of your job:

[jharvard@boslogin05 ~]# jobstats 12345678

================================================================================
                              Slurm Job Statistics 
================================================================================
Job ID: 12345678
User/Account: jharvard/jharvard_lab
Job Name: gpu_example
State: COMPLETED
Nodes: 1
CPU Cores: 32
CPU Memory: 200GB (6.2GB per CPU-core)
GPUs: 1
QOS/Partition: normal/gpu_h200
Cluster: odyssey
Start Time: Tue Nov 25, 2025 at 10:52 AM
Run Time: 02:59:53
Time Limit: 1-00:00:00

                              Overall Utilization 
================================================================================
CPU utilization  [|                                               3%]
CPU memory usage [                                                0%]
GPU utilization  [||||||||||||||||||||||||||||||||||||||||||||||100%]
GPU memory usage [|||||||||||||||                                31%]

                             Detailed Utilization 
================================================================================
CPU utilization per node (CPU time used/run time)
    holygpu8a12103: 03:00:12/3-23:56:16 (efficiency=3.1%)

CPU memory usage per node - used/allocated
    holygpu8a12103: 431.3MB/200GB (13.5MB/6.2GB per core of 32)

GPU utilization per node
    holygpu8a12103 (GPU 1): 100%

GPU memory usage per node - maximum used/total
    holygpu8a12103 (GPU 1): 44.0GB/140.4GB (31.3%)

                                  Notes 
================================================================================
* The max Memory utilization of this job is 0%. This value is low compared
  to the target range of 80% and above. Please investigate the reason for
  the low efficiency. For more info:
    https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Memory

* Have a nice day!

The summary provided gives you an overview of your job performance including a break down per node. In addition the command will flag under performance in red and point you to relevant documentation that you can use to improve your job efficiency. For example, when the user asked for 200GB of memory but used only 1GB, in future runs they should ask for 1GB of memory instead. Other items not flagged but worth adjusting would be to drop the number of cores to a single core and to reduce the requested time to 4 hours instead of a day. These changes would allow the job to run more efficiently, lowering impact on fairshare, and freeing resources for other users.

Note that for CPU utilization the CPU time used/run time factor is effectively the the amount of time used multiplied by the number of cores. In an ideal run your CPURuntime = NCPUS * Elapsed (wall-clock time). In this case the job ran for almost 3 hours on 32 cores which gives 4 days of CPURuntime, but it only actually used 3 hours across all its cores, this effectively means its using only 1 core. Hence in future runs you would only want to ask for one core, or figure out why the core is not parallelizing.

Jobstats Dashboard

To see a profile for a job you can use the Single Job Stats Dashboard (note: Need to be on FASRC VPN to access). Fill in your JobID and select which cluster you are using (note: For Cannon cluster you will want to select “odyssey” which is the old name for the cluster). You will then want to select the time range when your job ran to see the profile. You can even focus in on specific nodes if you want to see the profile.

Jobstats Emails

Slurm will put the results of jobstats into your completion emails. To subscribe add --mail-type=END, or options that include END, to your submission script. Email by default is sent to the email you have listed with us.

What should my job utilization be?

You can find target usage for memory, cpu, and gpu usage in the jobstats output. If your job underutilized resources, the “Notes” section will show you target ranges for each resource (cpu, memory, and gpu). See this cpu job:

[jharvard@holylogin07 ~]$ jobstats 49081039

================================================================================
Slurm Job Statistics
================================================================================
Job ID: 49081039
User/Account: jharvard/jharvard_lab
Job Name: .fasrcood/sys/dashboard/sys/RemoteDesktop
State: TIMEOUT
Nodes: 1
CPU Cores: 4
CPU Memory: 24GB (6GB per CPU-core)
QOS/Partition: normal/test
Cluster: odyssey
Start Time: Fri Dec 5, 2025 at 8:45 AM
Run Time: 01:00:12
Time Limit: 01:00:00

                            Overall Utilization
================================================================================
CPU utilization [                                                    0%]
CPU memory usage [|                                                  2%]

                           Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
holy8a24102: 00:00:39/04:00:48 (efficiency=0.3%)

CPU memory usage per node - used/allocated
holy8a24102: 591.5MB/24GB (147.9MB/6GB per core of 4)

                                   Notes
================================================================================
* The overall CPU utilization of this job is 0.3%. This value is low
compared to the target range of 90% and above. Please investigate the
reason for the low efficiency. For instance, have you conducted a scaling
analysis? For more info:
https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Cores

* The max Memory utilization of this job is 2%. This value is low compared
to the target range of 80% and above. Please investigate the reason for
the low efficiency. For more info:
https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Memory

* This job failed because it exceeded the time limit. If there are no other
problems then the solution is to increase the value of the --time Slurm
directive and resubmit the job. For more info:
https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Time

* Have a nice day!

The “Notes” section explains what was underutilized and the target range.

In the “Detailed Utilization” section, you can see values per core (and per node for a multi-node job). In this case, jharvard could have requested fewer cores and less memory. The job used ~600MB out of the 24GB requested. Instead, jharvard should have requested 750MB of memory (80% of 750=600MB). In terms of cores, jharvard should have requested 1 or 2 cores (given that this was an interactive job on Open OnDemand, 2 cores are recommended).

 

]]>
29251
Slurm Stats https://docs.rc.fas.harvard.edu/kb/slurm-stats/ Tue, 27 Aug 2024 15:39:58 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27554 Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis we pull data from the scheduler for the day and display a summary for you when you log in to the cluster in an easy to read table. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we are providing along with recommendations of where to go to get more information or to improve your performance.

The Statistics

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.003943                         |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:     25 |         4 |        1 |     20 |      0 |      0 |
| GPU:     98 |        96 |        1 |      1 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     4.3 |       5.5 |      69.4% |    133.00 CPU Hrs |
| Memory |   22.2G |     27.2G |      68.3% |                   |
| GPUS   |     0.5 |       1.0 |      51.4% |    100.20 GPU Hrs |
| Time   |  14.57h |    45.38h |      45.9% |             0.00h |
+---------------------------------------------------------------+

Above is what you will see when you login to the cluster if you have run jobs in the last day.  This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2024-08-13 -E 2024-08-20

For more detailed information on specific jobs you can use the jobstats and sacct commands. If you want summary plots of various statistics please see our XDMod instance (requires RC VPN). For fairshare usage plots see our Cannon and FASSE Fairshare Dashboards (requires RC VPN).  Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. This is as of the end of the day indicated. Lower fairshare means lower priority for your jobs on the cluster.  For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU.  Next the total number of jobs in that category is given, followed by a break down by state. Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally). Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator. Failed jobs are those jobs that the scheduler has detected as having a faulty exit. Out of Memory jobs are those that hit the requested memory limit set in the job script. Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time. Average Used is the average amount actually used by the job. Average Allocated is the average amount of resources allocated by the job script for the job. Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs. In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not.  Have unused resources that you have allocated means that your code is not utilizing all the space you’ve set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, jobstats, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs.  We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.

]]>
27554