fairshare – FASRC DOCS https://docs.rc.fas.harvard.edu Tue, 26 May 2026 19:39:56 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png fairshare – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 Slurm Stats https://docs.rc.fas.harvard.edu/kb/slurm-stats/ Tue, 27 Aug 2024 15:39:58 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27554 Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis, we pull data from the scheduler for the day and display a summary table when you log in to the cluster. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we provide along with recommendations of where to go to get more information or to improve your jobs’ performance.

The Statistics

+---------------- Slurm Stats for May 18 -----------------------+
|                  End of Day Fairshare                         |
|                     test_lab: 0.063426                        |
|                     jharvard_lab: 0.106328                    |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:      0 |         0 |        0 |      0 |      0 |      0 |
| GPU:     38 |        38 |        0 |      0 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     1.0 |       8.0 |      12.5% |    481.45 CPU Hrs |
| Memory |    8.7G |     32.0G |      27.3% |                   |
| GPUS   |     1.0 |       1.0 |      98.1% |     60.18 GPU Hrs |
| Time   |   1.58h |     6.00h |      26.4% |            10.33h |
+---------------------------------------------------------------+
| https://docs.rc.fas.harvard.edu/kb/slurm-stats                |
+---------------------------------------------------------------+

Above is the table that you will see when you log in to the cluster if you have run jobs in the last day.  This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2026-05-13 -E 2026-05-18

For more detailed information, FASRC offers a few tools and dashboards:

Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. The fairshare value is as of the end of the day indicated in the table title. Lower fairshare means lower priority for your jobs on the cluster.  For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job by State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU.  Next, the total number of jobs in that category is given, followed by a break down by state.

  • Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally).
  • Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator.
  • Failed jobs are those jobs that the scheduler has detected as having a faulty exit.
  • Out of Memory jobs are those that hit the requested memory limit set in the job script.
  • Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time.

  • Average Used is the average amount actually used by the job.
  • Average Allocated is the average amount of resources allocated by the job script for the job.
  • Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs.

In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not.  Having unused resources that you have allocated means that your code is not utilizing all the space you have set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, jobstats, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs.  We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated at the top of the table. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time — in the table above, TtS is 10.33h. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.

]]>
27554
Fairshare and Job Accounting https://docs.rc.fas.harvard.edu/kb/fairshare/ Wed, 16 Oct 2019 14:49:19 +0000 https://www.rc.fas.harvard.edu/?page_id=22014  

Summary

In order to ensure that all research labs get their fair share of the cluster and to account for differences in hardware being used, we utilize Slurm’s built-in job accounting and fairshare system. Every lab has a base Share of the community-wide system, which is governed by the Gratis Share purchased by the Faculty of Arts and Science and distributed equally to all labs. In addition, Shares purchased by individual labs by buying hardware are added to their base Share. The Fairshare score of a lab is then calculated based off of their Share versus the amount of the cluster they have actually used. This Fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual labs from monopolizing the resources, thus making it unfair to labs who have not used their fairshare for quite some time. Currently, we account for the fraction of the compute node used with CPU, GPU, and Memory usage using Slurm’s Trackable RESources (TRES).

What is Fairshare?

Fairshare is a portmanteau that pretty much expresses what it is. Essentially fairshare is a way of ensuring that users get their appropriate portion of a system. Sadly this term is also used confusingly for different parts of fairshare. This includes what fraction of the system users get, the score that the system assigns for users based off of your usage, and the priority that users are assigned based off of their usage. For the sake of the discussion below, we will use the following terms. Share is the portion of the system users have been granted. Usage is the amount of the system users have actually used. Fairshare score is the value the system calculates based off of user’s usage. Priority score is the priority assigned based off of the user’s fairshare score.

While Fairshare may seem complex and confusing, it is actually quite logical once you think about it. The scheduler needs some way to adjudicate who gets what resources. Different groups on the cluster have been granted different resources for various reasons. In order to serve the great variety of groups and needs on the cluster a method of fairly adjudicating job priority is required. This is the goal of Fairshare. Fairshare allows those users who have not fully used their resource grant to get higher priority for their jobs on the cluster, while making sure that those groups that have used more than their resource grant do not overuse the cluster. The cluster is a limited resource and Fairshare allows us to ensure everyone gets a fair opportunity to use it regardless of how big or small the group is.

Trackable RESources (TRES)

Slurm Trackable RESources (TRES) allows the scheduler to charge back users for how much they have used different features on the cluster. This is important as the usage of the cluster factors into the Fairshare calculation. These TRES charge backs vary from partition to partition. You can see what the TRES charge back is by running scontrol show partition <partitionname> and looking at the TRESBillingWeights category.

On Cannon we set TRES for CPU, GPU, and Memory usage. For most partitions we charge back for CPU’s and GPU’s based off of the type being used. We normalize TRES to 1.0 for Intel Cascade Lake chips. For other chips we calculate the TRES by taking the theoretical peak Floating Point OPerations (FLOPs) for a single core of that CPU (or entire GPU) and dividing it by the theoretic peak for the Intel Cascade Lake chips. With this weighting we end up with the following TRES per core:

Processor Type TRES
Intel Skylake 0.5
AMD Milan 0.5
AMD Genoa 0.6
Intel Sapphire Rapids 0.6
Intel Cascade Lake 1.0
Intel Ice Lake 1.15
Nvidia A40 10
Nvidia V100 75
Nvidia A100 209.1
Nvidia H100 546.9
Nvidia H200 546.9

It may seem to be a penalty to charge more for the Cascade Lake than the Sapphire Rapids, but it really is not in the end. The reason being is that jobs running on the Cascade Lake cores will run roughly 40% faster than the Sapphire Rapids chips. Thus the actual charge back to the user should be the same on a per job basis, it’s just a question of picking the right resource for the job you are running.

In the case of memory we set the TRES based off of the following formula NumCore*CoreTRES/TotalMem where NumCore is the number of cores per node, CoreTRES is the TRES score for that type of core, and TotalMem is the total available memory for the node. The reason we weight memory like this is that if a user uses up all the memory on the node the scheduler cannot schedule another job on that node even if there are available cores. The opposite is also true, if all the cores are used up the scheduler cannot schedule another job there even if there is free memory. Thus memory and CPU are exhaustible resources that impact each other. The above weighting allows us to ensure that memory costs the same as the CPU’s on a given node. For instance, lets say you have a node that has 128 GB of RAM and 32 Intel Cascade Lake cores. In this case every 4 GB of RAM used should be equivalent to a single core being used. Thus we should charge a TRES of 1.0 for 4 GB used, or 0.25 for every GB used. In the case of a Intel Sapphire Rapids node with 32 cores and 128 GB of RAM, you have the same scenario but now the Sapphire Rapids chips are worth 40% less, thus the memory also is worth 40% times less as so it is 0.15 for every GB used.

There is two exceptions to the above TRES rules and those are the requeue partitions, such as serial_requeue and gpu_requeue and the test partitions. For the requeue partitions, since jobs in these partitions can be interrupted by higher priority jobs at any time, this means that there could be a loss of computation time. This is especially true for jobs that are not able to snapshot their progress and restart from where they left off. Studies have shown that to make this type of model break even in terms of cost you need to charge back roughly half of what you normally would. So for the requeue partitions we charge a flat rate of 0.5 for CPU, 104.6 for GPU, and 0.125 per GB for Memory. Since the requeue partitions contain all our hardware, users can get access to normally very high cost CPU’s and GPU’s for cheaper. Thus if a user needs to run a lot of jobs the best way to optimize throughput and usage is to build their jobs to leverage the cheap resources in the requeue partitions. One should be aware though that the available cores in this partition vary wildly depending on how active any given primary partition is.

The other exception are the test partitions, such as test and gpu_test. These partitions are exempted from normal fairshare accounting. This allows users to use these partitions for interactive work, code development, and workflow testing prior to running on the production partitions without fear of exhausting their allocation.

To calculate the amount of TRES usage for a job one would calculate this equation:

Usage = Runtime * (CoreTRES*CoreAlloc + MemTRES*MemAlloc + GPUTRES*GPUAlloc)

Where Runtime is the amount of time the job runs for, Core/Mem/GPUTRES are the TRES weights, and Core/Mem/GPUAlloc are how many resources were allocated. The scalc calculator also has an option for computing the expected usage for a job.

Shares

On Cannon each user is associated with their primary group. This lab group is what is called an Account in Slurm. Users belong to Accounts, and Accounts have Shares granted to them. These Shares determine how much of the cluster that group has been granted. Users when they run are charged back for their runs against the Account (i.e. lab) they belong to.

Shares granted an Account come in three types that are summed together. The first type is the Gratis Share. This Gratis Share is the Share given to all labs that are part of the cluster owing to the investment that Research Computing, via the Faculty of Arts and Sciences, has made in Cannon. This Gratis Share is calculated by summing the CPU and GPU TRES for all the nodes in the public partitions, excepting the requeue partitions, and then dividing by the total number of Accounts on Cannon. Thus the Gratis Share roughly corresponds to the number of cores each group has been granted. Currently the Gratis Share is set to 250 for Cannon and 100 for FASSE.

The second type of Share is Lab Share. This Share is the Share given to those Labs who have purchased hardware for their own lab. The CPU and GPU TRES from that purchased hardware is summed and added to the Gratis Share for that Lab’s Account.

The third type of Share is Communal Partition Share. This Communal Partition Share is the Share given to labs who have gone in with other labs and have purchased hardware to be used in common by the group of labs (e.g. a partition for the entire department, or for a school, or a collaboration of labs). In these cases the CPU and GPU TRES is summed and then divided amongst the labs, per their discretion, and added to the Lab’s Account.

Thus the total Share an Account has is simply the addition of all of these types of Share. This Share is global to the whole cluster. So whether the Lab is running on their own dedicated partitions or on the public partitions, their Share is the same. The Share is simply the portion of the entire system they have been granted, and can be moved around as needed by the Lab to any of the resources available to them on the cluster.

Fairshare Score

Probably the easiest way to walk through how a Lab’s Fairshare Score is calculated is to explain what the Slurm tool sshare displays. This tool shows you all the components of your Fairshare calculation. Here is an example:

[root@holyitc01 ~]# sshare --account=test_lab -a
Account  User  RawShares NormShares RawUsage  EffectvUsage FairShare
-------------------- ---------- ---------- ----------- -----------
test_lab       244       0.001363   45566082  0.000572     0.747627
test_lab user1 parent    0.001363   8202875   0.000572     0.747627
test_lab user2 parent    0.001363   248820    0.000572     0.747627
test_lab user3 parent    0.001363   163318    0.000572     0.747627
test_lab user4 parent    0.001363   18901027  0.000572     0.747627
test_lab user5 parent    0.001363   18050039  0.000572     0.747627

The Account we are looking at is test_lab. The first line of the sshare output shows the summary for the whole lab, while the subsequent lines show the information for each user. The test_lab has been granted 244 RawShares. Each user of that lab has a RawShare of parent, this means that all the users pull from the total Share of the Account and do not have their own individual subShares of the Account Share. Thus all users in this lab have full access to the full Share of the Account.

The next column after RawShares is NormShares. NormShares is simply the Account’s RawShares divided by the total number of RawShares given out to all Accounts on the cluster. Essentially NormShare is the fraction of the cluster the account has been granted, in this case about 0.136%. Given the way we set up giving out RawShares on Cannon, the total number of RawShares should be equivalent to the number of CPU TRES on Cannon, that is 244 Cascade Lake cores.

Following NormShares we have RawUsage. RawUsage is the amount of TRES-sec the Account/User has used. Thus if a user used a single Cascade Lake core for one second, the user’s account would be charged 1 TRES-sec in RawUsage. This RawUsage is also attenuated by the halflife that is set for the cluster, which is currently 3 days. Thus work done in the last 3 days counts at full cost, work done 6 days ago costs half, work done 9 days ago one fourth, and so on. So RawUsage is the aggregate of the Account’s past usage with this halflife weighting factor. The RawUsage for the Account is the sum of the RawUsage for each user, thus sshare is an effective way to figure out which users have contributed the most to the Account’s score.

A quick aside, it should be noted that RawUsage is the sum of all usage including: failed jobs, jobs that are requeued, jobs that ran on nodes that failed, etc.  That usage is still counted as part of RawUsage.  The reason for this is that it is up to the user to effectively use the time and resources allocated by the scheduler even if that time is cut short for various reasons.  We highly recommend users test and verify their codes before running.  Users should also ensure their code has checkpointing enabled so that jobs can restart from where they left off in case of node failure.  These steps will minimize the effect of various failures on a user’s usage.

The next column is EffectvUsage. EffectvUsage is the Account’s RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the test_lab has used 0.057% of the cluster.

Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares) From this one can see that there are five basic regimes for this score which are as follows:

1.0: Unused. The Account has not run any jobs recently.

1.0 > f > 0.5: Underutilization. The Account is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2

0.5: Average utilization. The Account on average is using exactly as much as their granted Share.

0.5 > f > 0: Over-utilization. The Account has overused their granted Share. For example, when f=0.25 a lab has recently overutilized their Share of the resources 2:1

0: No share left. The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.

Since the usage of the cluster varies, the schedule does not stop Accounts from using more than their granted Share. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus an Account is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Account’s Fairshare score, but allow jobs for the Account to still start. Eventually, another Account with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share. Fairshare only recovers as a lab reduces the workload to allow other Accounts to run. The half-life helps to expedite this recovery.

Given this behavior of Fairshare, Accounts can also bank time for large computations that are beyond their average Share. For instance say the Lab knows it has a large parallel run to do, or alternatively a deadline to meet. The Lab can in preparation for this not run for several weeks. This will drive up their Fairshare as they will have not used their fraction of the cluster for that time period. This banked capacity can then be expended for a large run or series of runs. On the other hand, to continue the financial analogy, a group that has exhausted their Fairshare is in debt to the scheduler as they have used up far more than their granted Share. Thus they have to wait for that debt to be paid off by not running, which allows their Fairshare to recover. Again, when there is no contention for resources, even jobs with low Faishare scores will continue to start.

Job Priority

Now that we have discussed Fairshare we can now discuss how an individual job’s priority is calculated. Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are two components of Job Priority on Cannon. The first is the FairShare score multiplied by a weighting factor to turn it into an integer, in this case 10,000,000. A Fairshare of 1 would give a priority of 10,000,000, while a Fairshare of 0.5 would give a value of 5,000,000. We pick large numbers so we have resolution to break ties between Accounts that are close in Fairshare score. This Fairshare Priority evolves dynamically as the Fairshare of the Account changes over time.

The second component is Job Age. This priority accrues over time gaining a maximum value at 3 days on Cannon and 7 days on FASSE. As the job sits in the queue waiting to be scheduled, its priority is gradually increasing due to the Job Age. The maximum possible value for Job Age is 1,000,000. Thus a job that has been sitting for 1.5 days would have a Job Age Priority of 500,000. We set the Job Age Priority to a maximum of 1,000,000 so that a job from an Account with a Fairshare of 0 but has been pending for 3 days on Cannon would have the same priority as a job that was just submitted from an Account that has a Fairshare of 0.1. Thus even jobs from Accounts that have low Fairshare will schedule eventually due to the growth in their Job Age Priority.

These two components are summed together to make up an individual Job’s Priority. You can see this calculation for specific jobs by using the sprio command. In addition you can see the Pending queue of a specific partition ordered by job priority by using showq -o -p <partitionname>.

Nice

Slurm provides a way for users to adjust their own priority by defining a nice value.  Similar to the unix nice command, this flag allows users to deprioritize certain jobs.  Jobs that are deprioritized should have higher nice values than those that are more important.  Values for nice can run between 0 and 2147483645, negative values are not allowed.

Multiple Accounts

While most users are fine with having one Account they are associated with, some users do work for multiple Accounts. Slurm does have the ability to associate users with multiple Accounts, which allows users to charge back individual jobs to individual Accounts. Contact Research Computing if you are interested in this feature.

Historic Data

Research Computing keeps track of historic data for usage and Fairshare score. You can see your historic usage by going to the Cannon and FASSE Lab Fairshare pages and selecting the lab you belong to (note: you must be on the FASRC VPN to see it). sacct-plot is another tool you can use to plot your usage over time from the command line.

scalc

scalc is a calculator available on the cluster for figuring out various questions about fairshare. It includes a calculator for projecting a new Fairshare score based on a new RawShare, a calculator for figuring out how long it will take to restore fairshare, and a calculator for figuring out how much a set of jobs will cost in terms of cluster utilization and fairshare. When asked for to enter an account name, please enter your lab group name (e.g. – jharvard_lab). If you have additional calculations that you would like to see contact us.

stotal

stotal is a tool which calculates CPU-hours, GPU-hours, and TRES-hours for a specified user and account. This can be useful for assessing usage on the cluster with out any of the half-life decay that occurs for the values in sshare. Details on how to run this command on the CLI and all the options available with it are provided on its Github repo linked above. For example, one can execute the following command to see the compute usage of their lab with usage broken out by members of that lab:

stotal -A <LABNAME> -S <STARTTIME> -E <ENDTIME> -d

where <STARTTIME> & <ENDTIME> are in the format: YYYY-MM-DD

Note: To see statistics for anything beyond your user you will need special permission, contact FASRC if you are interested. 

FAQ

Q: My lab’s fairshare is low, what can I do?

There are several things that can be done when your fairshare is low:

  1. Do not run jobs: Fairshare recovers via two routes.  The first is via your group not running any jobs and letting others use the resource.  That allows your fractional usage to decrease which in turn increases your fairshare score.  The second is via the half-life we apply to fairshare which ages out old usage over time.  Both of these method require not action but inaction on the part of your group.  Thus to recover your fairshare simply stop running jobs until your fairshare reaches the level you desire.  Be warned this could take several weeks to accomplish depending on your current usage.
  2. Be patient: This is a corollary to the previous point but applies if you want to continue to run jobs.  Even if your fairshare is low, your job gains priority by sitting the queue.  The longer it sits the higher priority it gains.  So even if you have very low fairshare your jobs will eventually run, it just may take several days to accomplish.
  3. Leverage Backfill: Slurm runs in two scheduling loops.  The first loop is the main loop which simply looks at the top of the priority chain for the partition and tries to schedule that job.  It will schedule jobs until it hits a job it cannot schedule and then it restarts the loop.  The second loop is the backfill loop.  This loop looks through jobs further down in the queue and asks can I schedule this job now and not interfere with the start time of the top priority job.  Think of it as the scheduler playing giant game of three dimensional tetris, where the dimensions are number of cores, amount of memory, and amount of time.  If your job will fit in the gaps that the scheduler has it will put your job in that spot even if it is low priority.  This requires you to be very accurate in specifying the core, memory, and time usage of your job.  The better constrained your job is the more likely the scheduler is to fit you in to these gaps.  The jobstats  and seff-account utilities is are great ways of figuring out your job performance. See also our page on improving Job Efficiency.
  4. Leverage Requeue: The requeue partitions are cheaper to run in and have a lot of capacity.  You are more likely to find your job pending for a shorter time, even with low fairshare, in those partitions than in the higher demand non-requeue partitions.
  5. Plan: Better planning and knowledge of your historic usage can help you better budget your time on the cluster.  The cluster is not an infinite resource.  You have been allocated a slice of the cluster, thus it is best to budget your usage so that you can run high priority jobs when you need to.  We at FASRC are happy to consult with you as to how to best budget your usage.  Tools like scalc, jobstats, seff-account, seff-array, and the historic usage graphs are invaluable assets for this.  Beyond that doing analysis of your code efficiency and memory usage will help dramatically.  Most users vastly over estimate how much memory their job actually needs, dragging down their fairshare score over time.  Trimming these excess requests makes for more efficient usage.  Increasing code efficiency by taking time to optimize your code base can also be very beneficial as better, more efficient algorithms mean lower usage and therefore better fairshare.
  6. Purchase: If your group has persistent high demand that cannot be met with your current allocation, serious consideration should be given to purchasing hardware for the cluster.  This is not an immediate solution to the problem as it takes time for hardware to be built and installed.  That said once the hardware arrives your Share will be increased and your fairshare will improve commensurately.  Please contact FASRC for more information if you wish to purchase hardware for the cluster.

Q: If I am running jobs on my PI’s private partition, then why am I getting charged?

We give RawShares to everyone that can be used anywhere on the cluster since Fairshare is a global quantity. Hence a user is charged regardless of what partition they use.  Groups who have private partitions are granted RawShares equivalent to the hardware in that partition per the table above. This grant exactly offsets the use of the partition. Since Fairshare is global, a group could decide to leave their partition idle or undersubscribed and use their shares elsewhere on the cluster. This allows groups to be flexible regarding which partitions they decide to use.

]]>
22014
> Running Jobs https://docs.rc.fas.harvard.edu/kb/running-jobs/ Thu, 27 Feb 2014 16:56:28 +0000 https://rcwebsite2.rc.fas.harvard.edu/?page_id=10401 Introduction

Faculty of Arts and Sciences Research Computing (FASRC) hosts several collections of computers in what are called clusters. Each cluster is large number of individual compute servers networked together with a high speed interconnect and integrated with storage (see our Data Management guide for more). To manage work on these clusters FASRC uses Slurm.

Slurm is a open source scheduler from SchedMD. The job of Slurm is:

1. To govern what user gets what resources on the cluster and when.
2. To create allocations for individual units of work which are called jobs.
3. To ensure maximum utilization of the cluster.
4. To keep a historical record of usage.

Users interact with Slurm by submitting a job to the scheduler. The scheduler then puts that job in the pending queue for the selected subsection of the cluster (called a partition) for consideration. The scheduler will weight the job’s priority based on the users prior usage to ensure a fair distribution of resources. It will then try to schedule the highest priority work by playing a large scale game of Tetris. In addition Slurm will take lower priority jobs and try to fit them into various gaps it finds in order to maximize usage without impacting the time when the higher priority work would run.

Below we will walk you through how to submit jobs to the scheduler for work. We will also discuss how the cluster is organized and some best practices for use. For more details on the architecture of the cluster, please see our Job Efficiency and Optimization Best Practices page.


Getting Started

To submit jobs you will first need to set up your account.  Once you’ve gone through the account setup procedure, you can login to the cluster via ssh to a login node and/or use Open OnDemand. The guide below assumes that you will be using the command line (CLI) for interaction with Slurm.

FASRC cluster nodes run the Rocky distribution of the Linux operating system and commands are run under the bash shell. There are a number of Linux and bash references, cheat sheets and tutorials available on the web. FASRC’s own training is also available.

Storage and Scratch on the Cluster

Cluster nodes have file systems mounted for use by labs and individuals to store both on a temporary (called scratch) and long term basis. The Data Storage page covers the various storage options. Please use the appropriate storage for your jobs as each storage type has different purposes and performance characteristics.


Slurm Documentation

Comprehensive documentation for Slurm can be found at the official Slurm website. Note that these docs are always for the latest version of Slurm, while FASRC tries to keep up with the latest version you will want to cross check the version we run against the version the docs are for. To find the version of Slurm the cluster is running do sinfo --version.

You can also get documentation on individual commands by using the unix man command. This command will show you the manual for the command for the version of Slurm the cluster is using. For instance if you want the manual for the sinfo command you would run: man sinfo

Some other useful documentation sites are:

Summary of Slurm Commands

The table below shows a brief list of common Slurm commands. These commands are described in more detail below along with links to the Slurm doc site.

What you want to do SLURM SLURM Example
Submit a batch serial job sbatch sbatch runscript.sh
Run a script or application interactively
(do not use salloc on FASSE)
salloc salloc -p test -t 10 --mem 1G [script or app]
Start interactive session
(do not use salloc on FASSE)
salloc salloc -p test -t 10 --mem 1G
Kill a job scancel scancel JOBID
View status of your jobs sacct sacct -u USERNAME
Check job by id number sacct sacct -j JOBID
Check efficiency of job jobstats jobstats JOBID
List of available partitions spart spart
Check current partition queue state showq showq -o -p PARTITIONNAME
Details on current job, node, partition

 

 

 

scontrol

 

 

 

scontrol show job JOBID

scontrol show node NODENAME

scontrol show partition PARTITIONNAME

Schedule recurring batch job scrontab see scrontab document for example
Check fairshare sshare sshare -U 

 


Slurm Global Limits and Defaults

Before submitting any jobs users should familiarize themselves with:

FASRC has set several global limits that users should be aware of and should plan around. These limits exist to prevent any one person from taking over the cluster and also serve to prevent the cluster being overwhelmed due to poorly formed jobs. Users must work within these limits and should plan their work accordingly. This is typically done by breaking up their workflow into smaller chunks or by deliberately serializing their jobs to increase the job time and decrease the number of cores needed. The limits are as follows:

  • Maximum Number of Jobs per User: 10,100. This is meant to prevent any one user from monopolizing the cluster.
  • Maximum Array Size: 10,000. This is both array index and size. This is meant to prevent any one user from monopolizing the cluster. Note that each array index counts as a single job for purposes of the Maximum Number of Jobs per User, so this is intentionally redundant.
  • Maximum Number of Steps: 40,000. A job step is recorded by slurm for each invocation of srun by a job. This is meant to prevent run-away jobs.

All other limits are partition or node dependent. More on that below.

FASRC also sets the following defaults if nothing is requested:

  • Core Count: 1
  • Memory: 100 MB
  • GPU Count: 0
  • Partition: serial_requeue
  • Time: There is no default time set. Users must always declare time.

Users can set their own defaults by setting a definition file in $HOME/.slurm/defaults, for more see the CLI Filter doc.


Slurm Partitions

Partitions are a block of nodes on the cluster with their own scheduling policy. Partitions have various limits governing what types of jobs are appropriate to run in them. When a job is submitted it schedules to the specified partition(s) and then joins the pending queue. When the job is scheduled in a partition it will join the running queue for that partition. You can find out what partitions you have access to using the spart command. To learn more about a given partition run: scontrol show partition PARTITIONAME. To learn more about an individual node run: scontrol show node NODENAME. Below is a list of the public partitions on Cannon (FASSE can be found here).

 
Partition Nodes Cores per Node CPU Core Types Mem per Node (GB) Time Limit Max Jobs Max Cores GPU Capable? /scratch size (GB)
sapphire 186 112 Intel
“Sapphire Rapids”
990 3 days none none No 396
shared 310 48 Intel
“Cascade Lake”
184 3 days none none No 68
bigmem 4 112 Intel
“Sapphire Rapids”
1988 3 days none none No 396
bigmem_intermediate 3 64 Intel
“Ice Lake”
2000 14 days none none No 396
gpu 36 64 Intel
“Ice Lake”
990 3 days none none Yes (4 A100/node) 396
gpu_h200 22 112 Intel “Sapphire Rapids” 990 3 days none none Yes (4 H200/node) 843
intermediate 12 112 Intel
“Sapphire Rapids”
990 14 days none none No 396
unrestricted 8 48 Intel
“Cascade Lake”
184 none none none No 68
test 18 112 Intel
“Sapphire Rapids”
990 12 Hours 5 112 No 396
gpu_test 12 64 Intel
“Ice Lake”
487 12 Hours 2 64 Yes (8 A100 MIG 3g.20GB/node) – Limit 8 per job 172
remoteviz down 32 Intel
“Cascade Lake”
373 3 days none none Shared V100 GPUs for rendering 396
serial_requeue varies varies AMD/Intel varies 3 days none none No varies
gpu_requeue varies varies AMD/Intel varies 3 days none none Yes varies
PI/Lab nodes varies varies varies varies none none none varies varies

Partition Details

sapphire

The sapphire partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This partition has 186 nodes connected by a NDR InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Sapphire Rapids CPUs, 990 GB of RAM, and 400 GB of local scratch space. Each Intel CPU has 56 Cores, and 100 MB of cache.

When submitting MPI jobs on the sapphire partition, it may be advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space. Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric. Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.

shared

The shared partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This partition has 310 nodes connected by a HDR InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Cascade Lake CPUs, 184 GB of RAM, and 70 GB of local scratch space. Each Intel CPU has 24 Cores, and 48 MB of cache.

When submitting MPI jobs on the shared partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space. Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric. Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.

bigmem

This partition should be used for large memory work requiring greater than 1000 GB RAM per job. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler.

There is 3 day limit for work here. MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 4 nodes with 1988 GB of RAM

bigmem_intermediate

This partition should be used for large memory work requiring greater than 1000 GB RAM per job. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler. There is a minimum run time of 3 days and maximum run time of 14 days.

MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 3 nodes with 2000 GB of RAM

gpu

This 36 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 64 cores and is equipped with 4 x Nvidia A100s per node. See our GPU Computing section for more info on using and specifying GPU resources.

gpu_h200

This 22 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 112 cores and is equipped with 4 x NVidia H200s per node. See our GPU Computing section for more info on using and specifying GPU resources.

intermediate

Serial and parallel (including MPI) jobs are permitted on this partition and this partition is intended for runs needing 3 to 14 days of runtime. This partition has an allocation of 12 nodes of the same configuration as above for the sapphire partition.

unrestricted

Serial and parallel (including MPI) jobs are permitted on this partition and 365 day limit on run time. Given this, there is no guarantee of 100% uptime. Running on this partition is done at the users own risk. Users should understand that if the queue is full it could take weeks or up to months for your job to be scheduled to run. unrestricted is made up of 8 nodes of the same configuration as above for the shared partition.

test

This partition is dedicated for interactive (foreground / live) work and for testing (interactively) code before submitting in batch and scaling. Small numbers (1 to 5) of serial and parallel jobs with small resource requirements (RAM/cores) are permitted on this partition; large numbers of interactive jobs or those requiring large resource requirements should really be done on another partition. Multiple partition submissions to this partition are forbidden (i.e. one is not permitted to do #SBATCH -p test,sapphire).

This partition is made up of 18 nodes of the same configuration as above for the sapphire partition. This smaller queue has a 12 hour maximum run time. This queue has a maximum of 112 cores and 1000 GB RAM. Jobs in this queue are not charged fairshare.

gpu_test

This 14 node partition is for individuals wishing to test GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-8 in your SLURM submission scripts. These nodes have 64 cores and are equipped with 4 x Nvidia A100s in Multi-Instance GPU (MIG) mode. Each GPU has two 3g.20GB MIG instances. This queue has a maximum of 2 jobs, 64 cores, 512 GB RAM, 8 MIG GPU’s, 12 hour run time. Users must request less than 8 CPUs/MIG GPU and 64GB/MIG GPU. This partition is intended for interactive, testing, and experimentation only. Multiple partition submissions to this partition are forbidden. See our GPU Computing section for more info on using and specifying GPU resources. Jobs in this queue are not charged fairshare.

remoteviz

This single node partition is for individuals who wish to use shared GPU’s for rendering graphics. The V100 cards on this node are in shared mode and are not intended for computational use but instead of rendering. You do not need to request a gpu to use this partition. Multiple partition submissions to this partition are forbidden. For computation please use the gpu and gpu_test partitions.

serial_requeue

This partition is appropriate for single core (serial) jobs, jobs that require up to 8 cores for small periods of time (less than 1 day), or job arrays where each job instance uses less than 8 cores. Multinode jobs may be run in the partition but be advised that this is a heterogeneous partition and users are highly recommended to leverage the --constraint option to get a homogeneous block of compute and networking. The maximum runtime for this queue is 3 days. GPU jobs are rejected from this partition and should be run in gpu_requeue. As this partition is made up of an assortment of nodes owned by other groups in addition to the general nodes, jobs in this partition may be killed and requeued if a higher priority job (e.g. the job of a node owner) comes in.

Because serial_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared and sapphire partitions. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for serial_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it is advisable to have checkpointing enabled for your code. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events.

gpu_requeue

This partition is appropriate for gpu jobs that require small periods of time (less than 1 day). Multinode jobs may be run in the partition but be advised that this is a heterogeneous partition and users are highly recommended to leverage the --constraint option to get a homogeneous block of compute and networking. The maximum runtime for this queue is 3 days. One will need to include #SBATCH --gres=gpu:1 in your SLURM submission scripts to get access to this partition. As this partition is made up of an assortment of gpu nodes owned by other groups in addition to the public nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in.

Because gpu_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the gpu and gpu_h200 partitions. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for gpu_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it is advisable to have checkpointing enabled for your code. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events. See our GPU Computing section for more info on using and specifying GPU resources.

ITC, Kempner, HSPH, HUCE, and SEAS

For information on the partitions for these groups see:


Submitting Batch Jobs Using the sbatch Command

The main way to run jobs on the cluster is by submitting a script with the sbatch command. The command to submit a job is as simple as:

sbatch runscript.sh

The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from the cluster.

When sbatch is run Slurm copies the current user environment and submission script into the scheduler. Thus the user is free to update their environment and the submission script they used. Note that this behavior does not apply to any thing else, so files, folders, executables, etc. will be executed and used as they are on disk the moment the script starts to use and access them, so do not update those files if you do not want those changed propagated. When the scheduler launches the script, the script will start in the directory the user submitted the job from.

A typical submission script, in this case loading a Python module and having Python print a message, will look like this:

NOTE: It is important to keep all #SBATCH lines together and at the top of the script; no comments, bash code, or variables settings should be done until after the #SBATCH lines. Otherwise, Slurm may assume it’s done interpreting and skip any that follow.

#!/bin/bash
#SBATCH -c 1                # Number of cores (-c)
#SBATCH -t 0-00:10          # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p serial_requeue   # Partition to submit to
#SBATCH --mem=100           # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o myoutput_%j.out  # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors_%j.err  # File to which STDERR will be written, %j inserts jobid

# load modules
module load python/3.10.9-fasrc01

# run code
python -c 'print("Hi there.")'

In general, a submission script is composed of 4 parts:

  • The #!/bin/bash line allows the script to be run as a bash script.
  • The #SBATCH lines which are instructions for Slurm.
  • Commands loading any necessary modules and setting any variables, paths, etc.
  • The execution line itself, in this case calling python and having it print a message.

The #SBATCH lines shown above set the following key parameters:

  • #SBATCH -c 1: Sets the number of cores (threads) that you’re requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, Slurm assumes -c 1. For more on parallel work see: threads, MPI
  • #SBATCH -t 0-01:00: Specifies the running time for the job in day-hour:minute (DD-HH:MM) format. Other acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”,  and “days-hours:minutes:seconds”. If your job runs longer than the value you specify here, it will be canceled. Jobs have a maximum run time which varies by partition (see table above), though extensions can be done. There is no fairshare penalty for over-requesting time, though it will be harder for the scheduler to backfill your job if you overestimate.
  • #SBATCH -p serial_requeue: Specifies the Slurm partition under which the script will be run. See the partitions description above for more information. If you do not specify this parameter you will be given serial_requeue by default.
  • #SBATCH --mem=100: Specifies how much memory you require per node. Default units are MB, and users can use suffixes for other units [K|M|G|T]. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. The --mem option specifies the total memory pool per node. If you must do work across multiple compute nodes (e.g. MPI code) and want to scale your memory allocation on a per core basis, then you should use the --mem-per-cpu option, as this will allocate the amount specified for each of the cores you’re requesting, whether it is on one node or multiple nodes. If this parameter is omitted, then you are granted 100 MB by default. Chances are good that your job will be killed as it will likely go over this amount, so one should always specify how much memory you require.
  • #SBATCH -o myoutput_%j.out: Specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
  • #SBATCH -e myerrors_%j.err: Specifies the file to which standard error will be appended. Slurm submission and processing errors will also appear in the file. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.err in the current directory.
    #SBATCH --test-only
    While not shown above, adding this option to your script will tell the scheduler to return information on what would happen if you submit this job. This is a good and easy way to determine if you script is viable as well as give a rough estimate of how long it would take to schedule in the current queue load.
    #SBATCH --account=some_lab
    If you are in more than one lab, please ensure that you are charging your Fairshare to the appropriate group by using this option in all of your job scripts and specifying the lab group.

Other useful options not shown above are:

  • #SBATCH --gpus=1: Specifies how many gpus are needed for the computation. For more see the GPU specific section.
  • #SBATCH --test-only: Adding this option to your script will tell the scheduler to return information on what would happen if you submit this job. This is a good and easy way to determine if you script is viable as well as give a rough estimate of how long it would take to schedule in the current queue load.
  • #SBATCH --account=jharvard_lab: If you are in more than one lab, this option will charging your usagee to the appropriate group.

It should be noted that all options that are prefixed by #SBATCH can also be set on the command line and visa versa. For example if you wanted to set the partition via commandline instead you would do: sbatch -p PARTITIONNAME runscript.sh

Notifications by Email

The scheduler can send email to you for various job states (FAIL and END being the most useful). But please bear in mind that this must be used responsibly as one user can quickly overwhelm the mail system and affect the notifications of all users by clogging up the mail queue. Keep in mind that tens or even hundreds of thousands of jobs may be in flight at a given time. This is why below we will strongly caution against using the ALL mail type. If you are using a metascheduler, job arrays, or just many jobs, please try to avoid adding too much burden to the email queue; Sending hundreds or thousands of emails can cause email backups, not to mention fill up your inbox.

To add mail notification to your job script you can use the --mail-type option. You can find all the options available in the sbatch documentation. In addition if you specify END you will receive a summary of your job performance from jobstats.

The user to be notified is indicated with --mail-user. If no mail user is specified, Slurm uses the email address that is listed with your account.


Monitoring Job Progress

To monitor jobs use sacctsacct with out any options will print out all the jobs you have run in the past day. sacct -j JOBID will show you a specific job. Note that sacct is almost live data, in addition the various accounting fields (such as memory usage) are incomplete until the job finishes. For monitoring live performance stats use the jobstats command. Slurm keeps past job records, so users can look back at their historic usage for up to 6 months. If you need data from further back contact FASRC to get access to our job archive.

sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory and CPU usage for an array job (see below for details about job arrays):

[jharvard@boslogin01 ~]? sacct -j 44375501 --format JobID,Elapsed,ReqMem,MaxRSS,AllocCPUs,TotalCPU,State   
JobID      Elapsed    ReqMem   MaxRSS AllocCPUS TotalCPU State
------------ ---------- --------- ------- ---------- ---------- ----------
44375501_[1+ 00:00:00   40000Mc           8    00:00:00   PENDING
44375501_1   2-03:50:53 40000Mc           8    2-03:50:23 COMPLETED
44375501_1.+ 2-03:50:53 40000Mc 34372176K 6    2-03:50:23 COMPLETED
44375501_1.+ 2-03:50:53 40000Mc 1236K     8    00:00.004  COMPLETED
44375501_2   1-23:47:35 40000Mc           8    1-23:47:18 COMPLETED
44375501_2.+ 1-23:47:35 40000Mc 34467196K 6    1-23:47:17 COMPLETED
44375501_2.+ 1-23:47:36 40000Mc 1116K     8    00:00.003  COMPLETED
44375501_3   1-23:32:36 40000Mc           8    1-23:32:15 COMPLETED
44375501_3.+ 1-23:32:36 40000Mc 34389040K 6    1-23:32:15 COMPLETED
44375501_3.+ 1-23:32:37 40000Mc 1224K     8    00:00.004  COMPLETED
44375501_4   1-21:59:30 40000Mc           8    1-21:59:07 COMPLETED
44375501_4.+ 1-21:59:30 40000Mc 34389044K 6    1-21:59:07 COMPLETED

The jobstats and seff-account commands are summary commands based off the data in sacct.

Slurm provides information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, or FAILED.

PENDING Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.
RUNNING Job is running.
COMPLETED Job has finished and the command(s) have returned successfully (i.e. exit code 0).
CANCELLED Job has been terminated by the user or administrator using scancel.
FAILED Job finished with an exit code other than 0.

To learn more detailed information about individual jobs that are in the PENDING or RUNNING you can run the scontrol command. For example:

[jharvard@boslogin06 general]# scontrol show job 7000364
JobId=7000364 JobName=run_pros
UserId=jharvard(21442) GroupId=jharvard_lab(10483) MCS_label=N/A
Priority=313513 Nice=0 Account=jharvard_lab QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2026-04-21T05:51:54 EligibleTime=2026-04-21T05:51:54
AccrueTime=2026-04-21T05:51:54
StartTime=2026-04-22T00:20:00 EndTime=2026-04-22T04:20:00 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-21T09:28:15 Scheduler=Main
Partition=sapphire,shared AllocNode:Sid=holylogin06:928788
ReqNodeList=(null) ExcNodeList=(null)
NodeList= SchedNodeList=holy8a24607
StepMgrEnabled=Yes
NumNodes=1-1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=32,mem=250G,node=1,billing=36
AllocTRES=(null)
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=/n/netscratch/jharvard_lab/Lab/jharvard/run_BlueJay.sh
SubmitLine=sbatch run_BlueJay.sh
WorkDir=/n/netscratch/jharvard_lab/Lab/jharvard/finaladd5perc_v28
StdErr=
StdIn=/dev/null
StdOut=/n/netscratch/jharvard_lab/Lab/jharvard/finaladd5perc_v28/slurm-7000364.out

Of particular interest will be the Reason and StartTime fields. The Reason field will state why the job is pending, while the StartTime will give the current best estimate based on current cluster state as to when the job will start. Note that for job arrays this command will print out all elements, so it is best to specify which element you are interested in.

See the Broader Queue

The showq command can be used to show what the rest of the partition looks like. Often your job is pending due to other people in the partition. The showq command then shows you an overview of all the jobs for a specific partition. showq is invoked by doing:

showq -o -p PARTITIONNAME

Where -o orders the pending queue by priority, with the next job to be scheduled at the top. -p specifies the partition that you want to look at.

The sinfo command is used to get the general state of nodes in a partition. Nodes can be in the following states:

IDLE Node is available for work.
MIXED Node is partially used.
ALLOCATED Node is fully used.
COMPLETING Node has jobs which are finishing up.
PLANNED Node will be used by a future job.
RESERVED Node is part of a Reservation.
DRAINING Node is closed to new jobs and existing jobs will run to completion.
DOWN Node is offline.

You can then use scontrol show node NODENAME to get information on a given node including why it may be DOWN or DRAINING.


Canceling Jobs

If for any reason, you need to cancel a job that you’ve submitted, just use the scancel command with the job ID.

scancel JOBID

If you don’t keep track of the job ID returned from sbatch, you should be able to find it with the sacct command described above. scancel can also do bulk cancellations based on various parameters such as Job Name and Partition.


Interactive Jobs and salloc

Though batch submission is the best way to take full advantage of the compute power the cluster, foreground/interactive jobs can also be run. These can be useful for things like:

  • Iterative data exploration at the command line
  • RAM intensive graphical applications like MATLAB or SAS
  • Interactive “console tools” like R and Jupyter
  • Significant software development and compiling efforts

There are two main types of interactive sessions: Graphical User Interface (GUI) and Command Line Interface (CLI). For graphical sessions FASRC provides Open OnDemand (OOD). With Open OnDemand a user can launch a job which will start a Remote Desktop on the cluster or some other application in OOD.

Command line interactive jobs are instead launched directly from the login nodes using salloc. Please note that salloc is disabled on FASSE due to security considerations, you will want to use FASSE OOD instead. salloc has all the same options as sbatch. To start an interactive session run:

salloc -p test -c 1 --mem=4G -t 0-6:00:00

This will ask for 1 core and 4GB of memory on the test partition for 6 hours. With salloc if you append a command it will run it and then exit (this includes /bin/bash which will just exit), but if you append no command it will simply start a remote shell on the node the scheduler selects. Jobs submitted via salloc behave like normal jobs for the sake of scheduling, as such salloc may hang for a while if the partition you select is busy. As such it is wise to select a partition like test or gpu_test where you are guaranteed immediate access. If you intend to use a busy partition, we recommend switching to using Open OnDemand Remote Desktop.

Command line interactive sessions require you to be active in the session. If you go more than an hour without any kind of input, it will assume that you have left the session and will terminate it. If you have interactive tasks that must stretch over days, we recommend switching to Open OnDemand Remote Desktop.


Software

Users are permitted to install whatever software relevant to their research on the cluster, provided it complies with our Acceptable Use Policy. FASRC clusters run a unified Operating System (Rocky Linux 8) and system architecture (x86-64), so software built on one system should generally work on the entire cluster (unless built against a specific hardware type). Users are responsible for managing and maintaining their own software stack. Under no circumstances will a user be given sudo access to install software. See the software guide for more on how to use FASRC provided software modules, how to use Podman or Singularity containers, and how to install software of various types.


Using GPUs

To request a single GPU on slurm just add #SBATCH --gpus=1 to your submission script and it will give you access to a GPU. For more on GPU computing see our more in depth GPGPU Document.

Specifying GPU Type

For users who wish to specify which type of GPU they wish to use, especially for those using heterogeneous partitions like gpu_requeue, there are two methods that can be used. The first is using --constraint="<tag>", this will constrain the job to only run on gpus of a certain class. A full listing of constraints can be found below. The second method is defining the specific model you want using --gpus=<model>:1. For example if you want a A100 with 80GB of onboard memory then you would specify --gpus=nvidia_a100-sxm4-80gb:1.

a100

  • nvidia_a100-sxm4-40gb: Nvidia A100 SXM4 40GB
  • nvidia_a100-sxm4-80gb: Nvidia A100 SXM4 80GB

h100 & h200

  • nvidia_h100_80gb_hbm3: Nvidia H100 80GB HBM3
  • nvidia_h200: Nvidia H200 140GB

mig

  • nvidia_a100_1g.5gb: Nvidia A100 1g MIG 5GB
  • nvidia_a100_1g.10gb: Nvidia A100 1g MIG 10GB
  • nvidia_a100_3g.20gb: Nvidia A100 3g MIG 20GB

v100

  • tesla_v100-pcie-16gb: Nvidia V100 PCIe 16GB
  • tesla_v100-pcie-32gb: Nvidia V100 PCIe 32GB
  • tesla_v100s-pcie-32gb: Nvidia V100S PCIe 32GB

a40

  • nvidia_a40: Nvidia A40 40GB

rtx

  • nvidia_rtx_a6000: Nvidia RTX A6000 PCIe 48GB

Some of the GPUs listed here were purchased by specific groups and only available via gpu_requeue. To find out what specific types of gpu’s are available on a partition run scontrol show partition <PartitionName> and look under the TRES category.


Parallelization

Using Threads such as OpenMP

One of the basic methods for parallelization is to use a threading library, such as pthreads, OpenMP, or applications that use OpenMP under the hood (e.g. numpy, OpenBLAS). Slurm by default does not know what cores to assign to what process it runs, in addition for threaded applications you need to make sure that all the cores you request are on the same node. Below is an example script that both ensures all the cores are on the same node, and lets Slurm know which process gets the cores that you requested for threading.

#!/bin/bash
#SBATCH -c 8 # Number of threads
#SBATCH -t 0-00:30:00 # Amount of time needed DD-HH:MM:SS
#SBATCH -p sapphire # Partition to submit to
#SBATCH --mem-per-cpu=100 #Memory per cpu
module load intel/25.3.1-fasrc01
srun -c $SLURM_CPUS_PER_TASK MYPROGRAM > output.txt 2> errors.txt

The most important aspect of the threaded script above is the -c option which tells Slurm how many threads you intend to run with. If you are using OpenMP you will want notify it of how many threads it can use by setting OMP_NUM_THREADS before the executable:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Using MPI

MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. OpenMPI, MPICH, and Intel MPI are available as modules on the cluster. As described in the module documentation, MPI libraries are a special class of module, called “Comp”, that is compiler dependent. To load an MPI library, load the compiler first.

module load intel/25.3.1-fasrc01 openmpi/5.0.10-fasrc01

Once an MPI module is loaded, applications built against that library are made available. This dynamic loading mechanism prevents conflicts that can arise between compiler versions and MPI library flavors.

An example MPI script with comments is shown below:

#!/bin/bash
#SBATCH -n 128 # Number of cores
#SBATCH -t 10 # Runtime in minutes
#SBATCH -p sapphire # Partition to submit to
#SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem)
module load intel/25.3.1-fasrc01 openmpi/5.0.10-fasrc01
module load MYPROGRAM
srun -n $SLURM_NTASKS --mpi=pmix MYPROGRAM > output.txt 2> errors.txt

There are a number of important aspects to an MPI SLURM job.

  • Most partitions have a unified Infiniband fabric except for the requeue partitions. If you use the requeue partitions you will want to specify a IB fabric via the constraint option.
  • Memory should be allocated with the --mem-per-cpu option instead of --mem so that memory matches core utilization.
  • The -np option for mpirun or mpiexec (when these runners are used) should use the bash variable $SLURM_NTASKS so that the correct number of cores is passed to the MPI engine at runtime.
  • If network topology and communications overhead is a concern for your code, try using the --contiguous option which will ensure that all the cores you get will be adjacent to each other. Use this with caution though as it will make your job pend longer, as finding contiguous blocks of compute is difficult. Verify that the boost in performance is worth the extra wait time in the queue. If you do not include this option you will be given cores and what ever nodes that Slurm can find, which may be scattered across the cluster. Depending on your code this may or may not be a concern. Test your code in both modes to see if it is an option that is worth including if you don’t know off hand. It may not be worth including --continguous as the aggregate time of waiting plus runtime may be longer with --contiguous. The sbatch and srun documentation have more information on various fine tuning options.
  • The application must be MPI-enabled. Applications cannot take advantage of MPI parallelization unless the source code is specifically built for it.

Job Arrays

SLURM allows you to submit a number of “near identical” jobs simultaneously in the form of a job array. To take advantage of this, you will need a set of jobs that differ only by an “index” of some kind.

For example, say that you would like to run tophat, a splice-aware transcript-to-genome mapping tool, on 30 separate transcript files named trans1.fq, trans2.fq, trans3.fq, etc. First, construct a SLURM batch script, called tophat.sh, using special SLURM job array variables:

#!/bin/bash
#SBATCH -J tophat # A single job name for the array
#SBATCH -c 1 # Number of cores
#SBATCH --array=1-30 # Array range
#SBATCH -p serial_requeue # Partition
#SBATCH --mem 4000 # Memory request (4Gb)
#SBATCH -t 0-2:00 # Maximum execution time (D-HH:MM)
#SBATCH -o tophat_%A_%a.out # Standard output
#SBATCH -e tophat_%A_%a.err # Standard error

source activate tophat
tophat /n/netscratch/informatics_public/ref/ucsc/Mus_musculus/mm10/chromFatrans"${SLURM_ARRAY_TASK_ID}".fq

The --array flag sets the number of elements to be run. Each array element is treated by the scheduler as an independent job for the sake fo scheduling.

In the script, two types of substitution variables are available when running job arrays. The first, %A and %a, represent the job ID and the job array index, respectively. These can be used in the sbatch parameters to generate unique names. The second, SLURM_ARRAY_TASK_ID, is a bash environment variable that contains the current array index and can be used in the script itself. In this example, 30 jobs will be submitted each with a different input file and different standard error and standard out files. More detail can be found on the SLURM job array documentation page and our Submitting Large Numbers of Jobs page.


Checkpointing

Slurm does not automatically checkpoint, i.e. create files that your job can restart from. To protect against job failure (due to code error or node failure) and to allow your job to be broken up into smaller chunks it is always advisable to checkpoint your code so it can restart from where it left off. This is especially valuable for jobs on partitions subject to requeue, but is also just generally useful for any type of job. Checkpointing varies from code type to code type and needs to be implemented by the user as part of their code base. Some resources for checkpointing codes that do not have them built-in include Distributed MultiThreaded CheckPointing (DMTCP) and Checkpoint/Restore in Userspace (CRIU).


Job dependencies

Many scientific computing tasks consist of serial processing steps. A genome assembly pipeline, for example, may require sequence quality trimming, assembly, and annotation steps that must occur in series. Launching each of these jobs without manual intervention can be done by repeatedly polling the controller with sacct until the State is COMPLETED. However, it’s much more efficient to let the SLURM controller handle this using the --dependency option.

[jharvard@boslogin01 examples]? sbatch assemble_genome.sh
Submitted batch job 53013437
[jharvard@boslogin01 examples]? sbatch --dependency=afterok:53013437 annotate_genome.sh
[jharvard@boslogin01 examples]?

When submitting a job, specify a combination of “dependency type” and job ID in the --dependency option. afterok is an example of a dependency type that will run the dependent job if the parent job completes successfully (state goes to COMPLETED). The full list of dependency types can be found on the SLURM doc site in the man page for sbatch. It is best not to create a chain of dependencies that is greater than 2-3 levels. Any more than that and the scheduler will become significantly slower. Dependencies should only be used if the resource requirements between each step are significantly different, or if you need to wait for an array to complete before you run a single job that processes all the array results. Be sure to think about whether you truly need dependencies or not.


Job Constraints

Sometimes, especially on the requeue partitions, jobs need to be constrained to run on specific hardware. Many times this is due to either the code being compiled for a specific architecture or because the code runs more efficiently on a specific type of host. Slurm provides for this functionality via the --constraint option (see the sbatch documentation for usage details). The features for constraint are defined by FASRC and fall into three broad categories: Processor, GPU, and Network. You can match against multiple of these but keep in mind the more constraints you use the longer your job will pend for as the scheduler will find it more difficult to find nodes that fit your needs. A list of the features available on the cluster follows, you can also see the features for a specific node by doing scontrol show node NODENAME.

Processor

  • amd: All AMD processors
  • intel: All Intel processors
  • avx: All processors that are AVX capable
  • avx2: All processors that are AVX2 capable
  • avx512: All processors that are AVX512 capable
  • milan: AMD Milan chips
  • genoa: AMD Genoa chips
  • skylake: Intel Skylake chips
  • sapphirerapids: Intel Sapphire Rapids
  • cascadelake: Intel Cascade Lake chips
  • icelake: Intel Ice Lake chips

GPU

To specify a GPU model, for example, A100 with 80GB refer to Specifying GPU Type

  • rtxa6000: Nvidia RTX A6000 GPU
  • a40: Nvidia A40 GPU
  • v100: Nvidia V100 GPU
  • a100: Nvidia A100 GPU
  • a100-mig: Nvidia A100 GPU MIG
  • h100: Nvidia H100 GPU
  • h200: Nvidia H200 GPU

Network

  • holyhdr: Holyoke HDR Infiniband Fabric
  • holyndr: Holyoke NDR Infiniband Fabric

Fairshare and Job Prioritization

We use a multifactor method of job scheduling on the cluster. Job priority is assigned by a combination of fair-share and length of time a job has been sitting in the queue. You can find out the priority calculation for your jobs by using the sprio command, such as sprio -j JOBID.

Fairshare is shared on a lab basis, so usage by any member of the lab will impact the score of the whole lab as the lab is pulling from a common pool. Fairshare has a 3 day halflife and naturally recovers if your lab does not run any jobs. Thus it is wise to store up fairshare if you need to do significant runs, and plan your runs accordingly in order to maintain a good fairshare score. You can learn more about your fairshare score and slurm usage by using the sshare command, such as sshare -U which shows your current score.

The other factor in priority is how long you have been sitting in the queue. The longer your job sits in the queue the higher its priority grows, out to a maximum of 3 days. If everyone’s priority is equal then FIFO (first in first out) is the scheduling method. We weight the age of a job that has pended for 3 days to be equal to a fairshare score of 0.1.

We also have backfill turned on. This allows for jobs which are smaller to sneak in while a larger higher priority job is waiting for nodes to free up. If your job can run in the amount of time it takes for the other job to get all the nodes it needs, SLURM will schedule you to run during that period. This means knowing how long your code will run for is very important and must be declared if you wish to leverage this feature. Otherwise the scheduler will just assume you will use the maximum allowed time for the partition when you run. The better your constrain your job in terms of CPU, Memory, and Time the easier it will be for the backfill scheduler to find you space and let your job jump ahead in the queue.

For more see:


Troubleshooting

A variety of problems can arise when running jobs on the cluster. Many are related to resource misallocation, but there are other common problems as well.

Error Likely cause
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit, being killed Your job is attempting to use more memory than you’ve requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced. For jobs that require truly large amounts of memory (>1 Tb), you may need to use the bigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp.
SLURM_receive_msg: Socket timed out on send/recv operation This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to.
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE This message may arise for a variety of reasons, but it typically indicates that the host on which your job was running can no longer be contacted by SLURM. Jobs that die from NODE_FAILURE are automatically requeued by the scheduler.

 

]]>
10401