Job Efficiency – FASRC DOCS https://docs.rc.fas.harvard.edu Tue, 19 May 2026 19:39:02 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png Job Efficiency – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 Job Defense Shield https://docs.rc.fas.harvard.edu/kb/job-defense-shield/ Thu, 02 Apr 2026 15:09:36 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=29550 Overview

Job Defense Shield (JDS) is an application built by Princeton University Research Computing for job efficiency monitoring. Leveraging the statistics collected by jobstats, JDS allows administrators to trigger automated emails and other actions (including job cancellation) based on various thresholds. JDS is being used on FASRC clusters to send out weekly emails to users regarding job inefficiencies with the goal of aiding users in improving their usage and job throughput. The cluster is a shared resource and so it behooves all users to use it in the most efficient manner as it helps accelerate research for all cluster users.

JDS Emails

On Tuesday mornings JDS evaluates the previous week’s worth of cluster usage (Tuesday of the previous week to Monday of the current week). It then sends emails to users who meet certain thresholds of job inefficiency. The emails contain a description of the problem, list of jobs, and recommendations for how to improve efficiency. To stop receiving the emails users simply need to improve their efficiency beyond the recommended thresholds. The thresholds and recommendations for each alert email are as follows:

Jobs with Zero CPU Utilization

This alert indicates that your job allocated cores but used none of the cores allocated on one or more nodes. This can indicate that:

  • Your job did not start properly: Check your runscript and test to make sure that your script is working as intended.
  • You were idle: This means you started an interactive session but did not do anything in it. Only start interactive sessions if you intend to do work. Close sessions that you are done with.
  • Your job is not properly parallelizing: Check your code’s documentation regarding if and how your code parallelizes. If the documentation does not specify, talk to your colleagues who run the same code or reach out to the primary developer. Note that Slurm does not automatically parallelize code, even if you ask for more than one core.

Serial Jobs Allocating Multiple Cores

This alert indicates that your job is only using a single core but is asking for multiple. This indicates that your job did not parallelize properly or is not able to be parallelized. Check your code’s documentation regarding if and how your code parallelizes. If the documentation does not specify, talk to your colleagues who run the same code or reach out to the primary developer. Note that Slurm does not automatically parallelize code, even if you ask for more than one core.

Jobs with Low CPU Efficiency

This alert indicates that your job was below 80% CPU utilization for the run. This can indicate that:

  • Your job is not well optimized: Check your code’s documentation regarding if there are methods for improving optimization. If the documentation does not specify, talk to your colleagues who run the same code or reach out to the primary developer. We also have a general guide regarding code optimization that you can leverage to diagnose problems.
  • Your job is not properly parallelizing: Check your code’s documentation regarding if and how your code parallelizes. If the documentation does not specify, talk to your colleagues who run the same code or reach out to the primary developer. Note that Slurm does not automatically parallelize code, even if you ask for more than one core.
  • Your job is not scaling: Run a scaling test to find out how many cores your code can optimally run. See if there are newer versions of the code or compilers that have better scaling or work to better optimize your code for higher core counts.

Jobs Requesting Too Much CPU Memory

This alert indicates that your job was below 80% peak memory utilization for the run. To fix you will want to better constrain your memory allocation request.

Requesting Too Much Time for Jobs

This alert indicates that you job used less than 50% of the time that it requested. To fix you will want to better constrain your time allocation request.

Jobs with Zero GPU Utilization

This alert indicates that your job allocated GPUs but used none of them. This can indicate that:

  • Your job did not start properly: Check your runscript and test to make sure that your script is working as intended.
  • Your code was built for a specific GPU: Make sure your code is GPU type agnostic or leverage Slurm flags to specify the GPU type you need.
  • You were idle: This means you started an interactive session but did not do anything in it. Only start interactive sessions if you intend to do work. Close sessions that you are done with.

Jobs with Low GPU Efficiency

This alert indicates that your job was below 25% GPU utilization for the run. This can indicate that:

  • Your job is not well optimized: Check your code’s documentation regarding if there are methods for improving optimization. If the documentation does not specify, talk to your colleagues who run the same code or reach out to the primary developer. We also have a general guide regarding code optimization that you can leverage to diagnose problems.
  • GPU is too powerful: There may in fact be nothing you can do to further optimize your code and the GPU you are using is overkill for your workflow. In these cases it will be beneficial to switch to a less powerful GPU that is more closely aligned with your code performance needs.

Repeat Offense

Users should endeavour to rectify their workflows. Users who do not will be contacted by FASRC staff. Further failure to improve may lead to fairshare reduction, job cancellation, and banning of your account.

]]>
29550
jobstats https://docs.rc.fas.harvard.edu/kb/jobstats/ Tue, 25 Nov 2025 20:41:28 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=29251 Overview

The Princeton Jobstats platform provides profile and summary information for jobs on FASRC Clusters. This allows for greater insight into job performance than the standard Slurm commands. It is highly encouraged to use jobstats over the older seff command, especially as jobstats gives information on GPU usage. Jobstats works for both running and completed jobs, but does not work for jobs that last for under a minute.

Command

To use jobstats run:

jobstats JOBID

You will then get a summary of your job:

[jharvard@boslogin05 ~]# jobstats 12345678

================================================================================
                              Slurm Job Statistics 
================================================================================
Job ID: 12345678
User/Account: jharvard/jharvard_lab
Job Name: gpu_example
State: COMPLETED
Nodes: 1
CPU Cores: 32
CPU Memory: 200GB (6.2GB per CPU-core)
GPUs: 1
QOS/Partition: normal/gpu_h200
Cluster: odyssey
Start Time: Tue Nov 25, 2025 at 10:52 AM
Run Time: 02:59:53
Time Limit: 1-00:00:00

                              Overall Utilization 
================================================================================
CPU utilization  [|                                               3%]
CPU memory usage [                                                0%]
GPU utilization  [||||||||||||||||||||||||||||||||||||||||||||||100%]
GPU memory usage [|||||||||||||||                                31%]

                             Detailed Utilization 
================================================================================
CPU utilization per node (CPU time used/run time)
    holygpu8a12103: 03:00:12/3-23:56:16 (efficiency=3.1%)

CPU memory usage per node - used/allocated
    holygpu8a12103: 431.3MB/200GB (13.5MB/6.2GB per core of 32)

GPU utilization per node
    holygpu8a12103 (GPU 1): 100%

GPU memory usage per node - maximum used/total
    holygpu8a12103 (GPU 1): 44.0GB/140.4GB (31.3%)

                                  Notes 
================================================================================
* The max Memory utilization of this job is 0%. This value is low compared
  to the target range of 80% and above. Please investigate the reason for
  the low efficiency. For more info:
    https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Memory

* Have a nice day!

The summary provided gives you an overview of your job performance including a break down per node. In addition the command will flag under performance in red and point you to relevant documentation that you can use to improve your job efficiency. For example, when the user asked for 200GB of memory but used only 1GB, in future runs they should ask for 1GB of memory instead. Other items not flagged but worth adjusting would be to drop the number of cores to a single core and to reduce the requested time to 4 hours instead of a day. These changes would allow the job to run more efficiently, lowering impact on fairshare, and freeing resources for other users.

Note that for CPU utilization the CPU time used/run time factor is effectively the the amount of time used multiplied by the number of cores. In an ideal run your CPURuntime = NCPUS * Elapsed (wall-clock time). In this case the job ran for almost 3 hours on 32 cores which gives 4 days of CPURuntime, but it only actually used 3 hours across all its cores, this effectively means its using only 1 core. Hence in future runs you would only want to ask for one core, or figure out why the core is not parallelizing.

Jobstats Dashboard

To see a profile for a job you can use the Single Job Stats Dashboard (note: Need to be on FASRC VPN to access). Fill in your JobID and select which cluster you are using (note: For Cannon cluster you will want to select “odyssey” which is the old name for the cluster). You will then want to select the time range when your job ran to see the profile. You can even focus in on specific nodes if you want to see the profile.

Jobstats Emails

Slurm will put the results of jobstats into your completion emails. To subscribe add --mail-type=END, or options that include END, to your submission script. Email by default is sent to the email you have listed with us.

What should my job utilization be?

You can find target usage for memory, cpu, and gpu usage in the jobstats output. If your job underutilized resources, the “Notes” section will show you target ranges for each resource (cpu, memory, and gpu). See this cpu job:

[jharvard@holylogin07 ~]$ jobstats 49081039

================================================================================
Slurm Job Statistics
================================================================================
Job ID: 49081039
User/Account: jharvard/jharvard_lab
Job Name: .fasrcood/sys/dashboard/sys/RemoteDesktop
State: TIMEOUT
Nodes: 1
CPU Cores: 4
CPU Memory: 24GB (6GB per CPU-core)
QOS/Partition: normal/test
Cluster: odyssey
Start Time: Fri Dec 5, 2025 at 8:45 AM
Run Time: 01:00:12
Time Limit: 01:00:00

                            Overall Utilization
================================================================================
CPU utilization [                                                    0%]
CPU memory usage [|                                                  2%]

                           Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
holy8a24102: 00:00:39/04:00:48 (efficiency=0.3%)

CPU memory usage per node - used/allocated
holy8a24102: 591.5MB/24GB (147.9MB/6GB per core of 4)

                                   Notes
================================================================================
* The overall CPU utilization of this job is 0.3%. This value is low
compared to the target range of 90% and above. Please investigate the
reason for the low efficiency. For instance, have you conducted a scaling
analysis? For more info:
https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Cores

* The max Memory utilization of this job is 2%. This value is low compared
to the target range of 80% and above. Please investigate the reason for
the low efficiency. For more info:
https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Memory

* This job failed because it exceeded the time limit. If there are no other
problems then the solution is to increase the value of the --time Slurm
directive and resubmit the job. For more info:
https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/#Time

* Have a nice day!

The “Notes” section explains what was underutilized and the target range.

In the “Detailed Utilization” section, you can see values per core (and per node for a multi-node job). In this case, jharvard could have requested fewer cores and less memory. The job used ~600MB out of the 24GB requested. Instead, jharvard should have requested 750MB of memory (80% of 750=600MB). In terms of cores, jharvard should have requested 1 or 2 cores (given that this was an interactive job on Open OnDemand, 2 cores are recommended).

 

]]>
29251
Slurm Stats https://docs.rc.fas.harvard.edu/kb/slurm-stats/ Tue, 27 Aug 2024 15:39:58 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27554 Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis, we pull data from the scheduler for the day and display a summary table when you log in to the cluster. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we provide along with recommendations of where to go to get more information or to improve your jobs’ performance.

The Statistics

+---------------- Slurm Stats for May 18 -----------------------+
|                  End of Day Fairshare                         |
|                     test_lab: 0.063426                        |
|                     jharvard_lab: 0.106328                    |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:      0 |         0 |        0 |      0 |      0 |      0 |
| GPU:     38 |        38 |        0 |      0 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     1.0 |       8.0 |      12.5% |    481.45 CPU Hrs |
| Memory |    8.7G |     32.0G |      27.3% |                   |
| GPUS   |     1.0 |       1.0 |      98.1% |     60.18 GPU Hrs |
| Time   |   1.58h |     6.00h |      26.4% |            10.33h |
+---------------------------------------------------------------+
| https://docs.rc.fas.harvard.edu/kb/slurm-stats                |
+---------------------------------------------------------------+

Above is the table that you will see when you log in to the cluster if you have run jobs in the last day.  This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2026-05-13 -E 2026-05-18

For more detailed information, FASRC offers a few tools and dashboards:

Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. The fairshare value is as of the end of the day indicated in the table title. Lower fairshare means lower priority for your jobs on the cluster.  For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job by State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU.  Next, the total number of jobs in that category is given, followed by a break down by state.

  • Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally).
  • Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator.
  • Failed jobs are those jobs that the scheduler has detected as having a faulty exit.
  • Out of Memory jobs are those that hit the requested memory limit set in the job script.
  • Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time.

  • Average Used is the average amount actually used by the job.
  • Average Allocated is the average amount of resources allocated by the job script for the job.
  • Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs.

In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not.  Having unused resources that you have allocated means that your code is not utilizing all the space you have set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, jobstats, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs.  We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated at the top of the table. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time — in the table above, TtS is 10.33h. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.

]]>
27554