fairshare – FASRC DOCS https://docs.rc.fas.harvard.edu Tue, 31 Mar 2026 15:32:33 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.1 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png fairshare – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 Slurm Stats https://docs.rc.fas.harvard.edu/kb/slurm-stats/ Tue, 27 Aug 2024 15:39:58 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27554 Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis we pull data from the scheduler for the day and display a summary for you when you log in to the cluster in an easy to read table. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we are providing along with recommendations of where to go to get more information or to improve your performance.

The Statistics

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.003943                         |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:     25 |         4 |        1 |     20 |      0 |      0 |
| GPU:     98 |        96 |        1 |      1 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     4.3 |       5.5 |      69.4% |    133.00 CPU Hrs |
| Memory |   22.2G |     27.2G |      68.3% |                   |
| GPUS   |     0.5 |       1.0 |      51.4% |    100.20 GPU Hrs |
| Time   |  14.57h |    45.38h |      45.9% |             0.00h |
+---------------------------------------------------------------+

Above is what you will see when you login to the cluster if you have run jobs in the last day.  This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2024-08-13 -E 2024-08-20

For more detailed information on specific jobs you can use the jobstats and sacct commands. If you want summary plots of various statistics please see our XDMod instance (requires RC VPN). For fairshare usage plots see our Cannon and FASSE Fairshare Dashboards (requires RC VPN).  Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. This is as of the end of the day indicated. Lower fairshare means lower priority for your jobs on the cluster.  For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU.  Next the total number of jobs in that category is given, followed by a break down by state. Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally). Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator. Failed jobs are those jobs that the scheduler has detected as having a faulty exit. Out of Memory jobs are those that hit the requested memory limit set in the job script. Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time. Average Used is the average amount actually used by the job. Average Allocated is the average amount of resources allocated by the job script for the job. Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs. In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not.  Have unused resources that you have allocated means that your code is not utilizing all the space you’ve set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, jobstats, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs.  We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.

]]>
27554
Fairshare and Job Accounting https://docs.rc.fas.harvard.edu/kb/fairshare/ Wed, 16 Oct 2019 14:49:19 +0000 https://www.rc.fas.harvard.edu/?page_id=22014  

Summary

In order to ensure that all research labs get their fair share of the cluster and to account for differences in hardware being used, we utilize Slurm’s built-in job accounting and fairshare system. Every lab has a base Share of the community-wide system, which is governed by the Gratis Share purchased by the Faculty of Arts and Science and distributed equally to all labs. In addition, Shares purchased by individual labs by buying hardware are added to their base Share. The Fairshare score of a lab is then calculated based off of their Share versus the amount of the cluster they have actually used. This Fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual labs from monopolizing the resources, thus making it unfair to labs who have not used their fairshare for quite some time. Currently, we account for the fraction of the compute node used with CPU, GPU, and Memory usage using Slurm’s Trackable RESources (TRES).

What is Fairshare?

Fairshare is a portmanteau that pretty much expresses what it is. Essentially fairshare is a way of ensuring that users get their appropriate portion of a system. Sadly this term is also used confusingly for different parts of fairshare. This includes what fraction of the system users get, the score that the system assigns for users based off of your usage, and the priority that users are assigned based off of their usage. For the sake of the discussion below, we will use the following terms. Share is the portion of the system users have been granted. Usage is the amount of the system users have actually used. Fairshare score is the value the system calculates based off of user’s usage. Priority score is the priority assigned based off of the user’s fairshare score.

While Fairshare may seem complex and confusing, it is actually quite logical once you think about it. The scheduler needs some way to adjudicate who gets what resources. Different groups on the cluster have been granted different resources for various reasons. In order to serve the great variety of groups and needs on the cluster a method of fairly adjudicating job priority is required. This is the goal of Fairshare. Fairshare allows those users who have not fully used their resource grant to get higher priority for their jobs on the cluster, while making sure that those groups that have used more than their resource grant do not overuse the cluster. The cluster is a limited resource and Fairshare allows us to ensure everyone gets a fair opportunity to use it regardless of how big or small the group is.

Trackable RESources (TRES)

Slurm Trackable RESources (TRES) allows the scheduler to charge back users for how much they have used different features on the cluster. This is important as the usage of the cluster factors into the Fairshare calculation. These TRES charge backs vary from partition to partition. You can see what the TRES charge back is by running scontrol show partition <partitionname> and looking at the TRESBillingWeights category.

On Cannon we set TRES for CPU, GPU, and Memory usage. For most partitions we charge back for CPU’s and GPU’s based off of the type being used. We normalize TRES to 1.0 for Intel Cascade Lake chips. For other chips we calculate the TRES by taking the theoretical peak Floating Point OPerations (FLOPs) for a single core of that CPU (or entire GPU) and dividing it by the theoretic peak for the Intel Cascade Lake chips. With this weighting we end up with the following TRES per core:

Processor Type TRES
Intel Skylake 0.5
AMD Milan 0.5
AMD Genoa 0.6
Intel Sapphire Rapids 0.6
Intel Cascade Lake 1.0
Intel Ice Lake 1.15
Nvidia A40 10
Nvidia V100 75
Nvidia A100 209.1
Nvidia H100 546.9
Nvidia H200 546.9

It may seem to be a penalty to charge more for the Cascade Lake than the Sapphire Rapids, but it really is not in the end. The reason being is that jobs running on the Cascade Lake cores will run roughly 40% faster than the Sapphire Rapids chips. Thus the actual charge back to the user should be the same on a per job basis, it’s just a question of picking the right resource for the job you are running.

In the case of memory we set the TRES based off of the following formula NumCore*CoreTRES/TotalMem where NumCore is the number of cores per node, CoreTRES is the TRES score for that type of core, and TotalMem is the total available memory for the node. The reason we weight memory like this is that if a user uses up all the memory on the node the scheduler cannot schedule another job on that node even if there are available cores. The opposite is also true, if all the cores are used up the scheduler cannot schedule another job there even if there is free memory. Thus memory and CPU are exhaustible resources that impact each other. The above weighting allows us to ensure that memory costs the same as the CPU’s on a given node. For instance, lets say you have a node that has 128 GB of RAM and 32 Intel Cascade Lake cores. In this case every 4 GB of RAM used should be equivalent to a single core being used. Thus we should charge a TRES of 1.0 for 4 GB used, or 0.25 for every GB used. In the case of a Intel Sapphire Rapids node with 32 cores and 128 GB of RAM, you have the same scenario but now the Sapphire Rapids chips are worth 40% less, thus the memory also is worth 40% times less as so it is 0.15 for every GB used.

There is two exceptions to the above TRES rules and those are the requeue partitions, such as serial_requeue and gpu_requeue and the test partitions. For the requeue partitions, since jobs in these partitions can be interrupted by higher priority jobs at any time, this means that there could be a loss of computation time. This is especially true for jobs that are not able to snapshot their progress and restart from where they left off. Studies have shown that to make this type of model break even in terms of cost you need to charge back roughly half of what you normally would. So for the requeue partitions we charge a flat rate of 0.5 for CPU, 104.6 for GPU, and 0.125 per GB for Memory. Since the requeue partitions contain all our hardware, users can get access to normally very high cost CPU’s and GPU’s for cheaper. Thus if a user needs to run a lot of jobs the best way to optimize throughput and usage is to build their jobs to leverage the cheap resources in the requeue partitions. One should be aware though that the available cores in this partition vary wildly depending on how active any given primary partition is.

The other exception are the test partitions, such as test and gpu_test. These partitions are exempted from normal fairshare accounting. This allows users to use these partitions for interactive work, code development, and workflow testing prior to running on the production partitions without fear of exhausting their allocation.

To calculate the amount of TRES usage for a job one would calculate this equation:

Usage = Runtime * (CoreTRES*CoreAlloc + MemTRES*MemAlloc + GPUTRES*GPUAlloc)

Where Runtime is the amount of time the job runs for, Core/Mem/GPUTRES are the TRES weights, and Core/Mem/GPUAlloc are how many resources were allocated. The scalc calculator also has an option for computing the expected usage for a job.

Shares

On Cannon each user is associated with their primary group. This lab group is what is called an Account in Slurm. Users belong to Accounts, and Accounts have Shares granted to them. These Shares determine how much of the cluster that group has been granted. Users when they run are charged back for their runs against the Account (i.e. lab) they belong to.

Shares granted an Account come in three types that are summed together. The first type is the Gratis Share. This Gratis Share is the Share given to all labs that are part of the cluster owing to the investment that Research Computing, via the Faculty of Arts and Sciences, has made in Cannon. This Gratis Share is calculated by summing the CPU and GPU TRES for all the nodes in the public partitions, excepting the requeue partitions, and then dividing by the total number of Accounts on Cannon. Thus the Gratis Share roughly corresponds to the number of cores each group has been granted. Currently the Gratis Share is set to 250 for Cannon and 100 for FASSE.

The second type of Share is Lab Share. This Share is the Share given to those Labs who have purchased hardware for their own lab. The CPU and GPU TRES from that purchased hardware is summed and added to the Gratis Share for that Lab’s Account.

The third type of Share is Communal Partition Share. This Communal Partition Share is the Share given to labs who have gone in with other labs and have purchased hardware to be used in common by the group of labs (e.g. a partition for the entire department, or for a school, or a collaboration of labs). In these cases the CPU and GPU TRES is summed and then divided amongst the labs, per their discretion, and added to the Lab’s Account.

Thus the total Share an Account has is simply the addition of all of these types of Share. This Share is global to the whole cluster. So whether the Lab is running on their own dedicated partitions or on the public partitions, their Share is the same. The Share is simply the portion of the entire system they have been granted, and can be moved around as needed by the Lab to any of the resources available to them on the cluster.

Fairshare Score

Probably the easiest way to walk through how a Lab’s Fairshare Score is calculated is to explain what the Slurm tool sshare displays. This tool shows you all the components of your Fairshare calculation. Here is an example:

[root@holyitc01 ~]# sshare --account=test_lab -a
Account  User  RawShares NormShares RawUsage  EffectvUsage FairShare
-------------------- ---------- ---------- ----------- -----------
test_lab       244       0.001363   45566082  0.000572     0.747627
test_lab user1 parent    0.001363   8202875   0.000572     0.747627
test_lab user2 parent    0.001363   248820    0.000572     0.747627
test_lab user3 parent    0.001363   163318    0.000572     0.747627
test_lab user4 parent    0.001363   18901027  0.000572     0.747627
test_lab user5 parent    0.001363   18050039  0.000572     0.747627

The Account we are looking at is test_lab. The first line of the sshare output shows the summary for the whole lab, while the subsequent lines show the information for each user. The test_lab has been granted 244 RawShares. Each user of that lab has a RawShare of parent, this means that all the users pull from the total Share of the Account and do not have their own individual subShares of the Account Share. Thus all users in this lab have full access to the full Share of the Account.

The next column after RawShares is NormShares. NormShares is simply the Account’s RawShares divided by the total number of RawShares given out to all Accounts on the cluster. Essentially NormShare is the fraction of the cluster the account has been granted, in this case about 0.136%. Given the way we set up giving out RawShares on Cannon, the total number of RawShares should be equivalent to the number of CPU TRES on Cannon, that is 244 Cascade Lake cores.

Following NormShares we have RawUsage. RawUsage is the amount of TRES-sec the Account/User has used. Thus if a user used a single Cascade Lake core for one second, the user’s account would be charged 1 TRES-sec in RawUsage. This RawUsage is also attenuated by the halflife that is set for the cluster, which is currently 3 days. Thus work done in the last 3 days counts at full cost, work done 6 days ago costs half, work done 9 days ago one fourth, and so on. So RawUsage is the aggregate of the Account’s past usage with this halflife weighting factor. The RawUsage for the Account is the sum of the RawUsage for each user, thus sshare is an effective way to figure out which users have contributed the most to the Account’s score.

A quick aside, it should be noted that RawUsage is the sum of all usage including: failed jobs, jobs that are requeued, jobs that ran on nodes that failed, etc.  That usage is still counted as part of RawUsage.  The reason for this is that it is up to the user to effectively use the time and resources allocated by the scheduler even if that time is cut short for various reasons.  We highly recommend users test and verify their codes before running.  Users should also ensure their code has checkpointing enabled so that jobs can restart from where they left off in case of node failure.  These steps will minimize the effect of various failures on a user’s usage.

The next column is EffectvUsage. EffectvUsage is the Account’s RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the test_lab has used 0.057% of the cluster.

Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares) From this one can see that there are five basic regimes for this score which are as follows:

1.0: Unused. The Account has not run any jobs recently.

1.0 > f > 0.5: Underutilization. The Account is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2

0.5: Average utilization. The Account on average is using exactly as much as their granted Share.

0.5 > f > 0: Over-utilization. The Account has overused their granted Share. For example, when f=0.25 a lab has recently overutilized their Share of the resources 2:1

0: No share left. The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.

Since the usage of the cluster varies, the schedule does not stop Accounts from using more than their granted Share. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus an Account is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Account’s Fairshare score, but allow jobs for the Account to still start. Eventually, another Account with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share. Fairshare only recovers as a lab reduces the workload to allow other Accounts to run. The half-life helps to expedite this recovery.

Given this behavior of Fairshare, Accounts can also bank time for large computations that are beyond their average Share. For instance say the Lab knows it has a large parallel run to do, or alternatively a deadline to meet. The Lab can in preparation for this not run for several weeks. This will drive up their Fairshare as they will have not used their fraction of the cluster for that time period. This banked capacity can then be expended for a large run or series of runs. On the other hand, to continue the financial analogy, a group that has exhausted their Fairshare is in debt to the scheduler as they have used up far more than their granted Share. Thus they have to wait for that debt to be paid off by not running, which allows their Fairshare to recover. Again, when there is no contention for resources, even jobs with low Faishare scores will continue to start.

Job Priority

Now that we have discussed Fairshare we can now discuss how an individual job’s priority is calculated. Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are two components of Job Priority on Cannon. The first is the FairShare score multiplied by a weighting factor to turn it into an integer, in this case 10,000,000. A Fairshare of 1 would give a priority of 10,000,000, while a Fairshare of 0.5 would give a value of 5,000,000. We pick large numbers so we have resolution to break ties between Accounts that are close in Fairshare score. This Fairshare Priority evolves dynamically as the Fairshare of the Account changes over time.

The second component is Job Age. This priority accrues over time gaining a maximum value at 3 days on Cannon and 7 days on FASSE. As the job sits in the queue waiting to be scheduled, its priority is gradually increasing due to the Job Age. The maximum possible value for Job Age is 1,000,000. Thus a job that has been sitting for 1.5 days would have a Job Age Priority of 500,000. We set the Job Age Priority to a maximum of 1,000,000 so that a job from an Account with a Fairshare of 0 but has been pending for 3 days on Cannon would have the same priority as a job that was just submitted from an Account that has a Fairshare of 0.1. Thus even jobs from Accounts that have low Fairshare will schedule eventually due to the growth in their Job Age Priority.

These two components are summed together to make up an individual Job’s Priority. You can see this calculation for specific jobs by using the sprio command. In addition you can see the Pending queue of a specific partition ordered by job priority by using showq -o -p <partitionname>.

Nice

Slurm provides a way for users to adjust their own priority by defining a nice value.  Similar to the unix nice command, this flag allows users to deprioritize certain jobs.  Jobs that are deprioritized should have higher nice values than those that are more important.  Values for nice can run between 0 and 2147483645, negative values are not allowed.

Multiple Accounts

While most users are fine with having one Account they are associated with, some users do work for multiple Accounts. Slurm does have the ability to associate users with multiple Accounts, which allows users to charge back individual jobs to individual Accounts. Contact Research Computing if you are interested in this feature.

Historic Data

Research Computing keeps track of historic data for usage and Fairshare score. You can see your historic usage by going to the Cannon and FASSE Lab Fairshare pages and selecting the lab you belong to (note: you must be on the FASRC VPN to see it).

scalc

scalc is a calculator available on the cluster for figuring out various questions about fairshare. It includes a calculator for projecting a new Fairshare score based on a new RawShare, a calculator for figuring out how long it will take to restore fairshare, and a calculator for figuring out how much a set of jobs will cost in terms of cluster utilization and fairshare. When asked for to enter an account name, please enter your lab group name (e.g. – jharvard_lab). If you have additional calculations that you would like to see contact us.

stotal

stotal is a tool which calculates CPU-hours, GPU-hours, and TRES-hours for a specified user and account. This can be useful for assessing usage on the cluster with out any of the half-life decay that occurs for the values in sshare. Note that to see statistics for anything beyond your user you will need special permission, contact FASRC if you are interested.

FAQ

Q: My lab’s fairshare is low, what can I do?

There are several things that can be done when your fairshare is low:

  1. Do not run jobs: Fairshare recovers via two routes.  The first is via your group not running any jobs and letting others use the resource.  That allows your fractional usage to decrease which in turn increases your fairshare score.  The second is via the half-life we apply to fairshare which ages out old usage over time.  Both of these method require not action but inaction on the part of your group.  Thus to recover your fairshare simply stop running jobs until your fairshare reaches the level you desire.  Be warned this could take several weeks to accomplish depending on your current usage.
  2. Be patient: This is a corollary to the previous point but applies if you want to continue to run jobs.  Even if your fairshare is low, your job gains priority by sitting the queue.  The longer it sits the higher priority it gains.  So even if you have very low fairshare your jobs will eventually run, it just may take several days to accomplish.
  3. Leverage Backfill: Slurm runs in two scheduling loops.  The first loop is the main loop which simply looks at the top of the priority chain for the partition and tries to schedule that job.  It will schedule jobs until it hits a job it cannot schedule and then it restarts the loop.  The second loop is the backfill loop.  This loop looks through jobs further down in the queue and asks can I schedule this job now and not interfere with the start time of the top priority job.  Think of it as the scheduler playing giant game of three dimensional tetris, where the dimensions are number of cores, amount of memory, and amount of time.  If your job will fit in the gaps that the scheduler has it will put your job in that spot even if it is low priority.  This requires you to be very accurate in specifying the core, memory, and time usage of your job.  The better constrained your job is the more likely the scheduler is to fit you in to these gaps.  The jobstats  and seff-account utilities is are great ways of figuring out your job performance. See also our page on improving Job Efficiency.
  4. Leverage Requeue: The requeue partitions are cheaper to run in and have a lot of capacity.  You are more likely to find your job pending for a shorter time, even with low fairshare, in those partitions than in the higher demand non-requeue partitions.
  5. Plan: Better planning and knowledge of your historic usage can help you better budget your time on the cluster.  The cluster is not an infinite resource.  You have been allocated a slice of the cluster, thus it is best to budget your usage so that you can run high priority jobs when you need to.  We at FASRC are happy to consult with you as to how to best budget your usage.  Tools like scalc, jobstats, seff-account, seff-array, and the historic usage graphs are invaluable assets for this.  Beyond that doing analysis of your code efficiency and memory usage will help dramatically.  Most users vastly over estimate how much memory their job actually needs, dragging down their fairshare score over time.  Trimming these excess requests makes for more efficient usage.  Increasing code efficiency by taking time to optimize your code base can also be very beneficial as better, more efficient algorithms mean lower usage and therefore better fairshare.
  6. Purchase: If your group has persistent high demand that cannot be met with your current allocation, serious consideration should be given to purchasing hardware for the cluster.  This is not an immediate solution to the problem as it takes time for hardware to be built and installed.  That said once the hardware arrives your Share will be increased and your fairshare will improve commensurately.  Please contact FASRC for more information if you wish to purchase hardware for the cluster.

Q: If I am running jobs on my PI’s private partition, then why am I getting charged?

We give RawShares to everyone that can be used anywhere on the cluster since Fairshare is a global quantity. Hence a user is charged regardless of what partition they use.  Groups who have private partitions are granted RawShares equivalent to the hardware in that partition per the table above. This grant exactly offsets the use of the partition. Since Fairshare is global, a group could decide to leave their partition idle or undersubscribed and use their shares elsewhere on the cluster. This allows groups to be flexible regarding which partitions they decide to use.

]]>
22014
> Running Jobs https://docs.rc.fas.harvard.edu/kb/running-jobs/ Thu, 27 Feb 2014 16:56:28 +0000 https://rcwebsite2.rc.fas.harvard.edu/?page_id=10401 Tip: Along with this document, please also see our Data Management Best Practices guide.

Overview: The FASRC Cluster Uses Slurm to Manage Jobs

Slurm (aka SLURM) is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was originally developed at the Lawrence Livermore National Lab, but is now primarily developed by SchedMD. Slurm is the scheduler that currently runs some of the largest compute clusters in the world.
Slurm is similar in many ways to most other queuing systems. You write a batch script then submit it to the queue manager. The queue manager then schedules your job to run on the queue (or partition in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.
Slurm has a number of valuable features compared to other job management systems:

  • Stop and Requeue: SLURM’s ability to kill and requeue is superior to that of other systems. It waits for jobs to be cleared before scheduling the high priority job. It also does requeue on memory rather than just on core count.
  • Memory requests are sacrosanct in SLURM. Thus the amount of memory you request at runtime is guaranteed to be there. No one can infringe on that memory space and you cannot exceed the amount of memory that you request.
  • Slurm has a concept called GRES (Generic Resource) that allows for fair scheduling on GPU’s and other accelerators. This is very handy in a dynamic research environment like RC’s where various different hardware technologies can be put into the scheduler.
  • SLURM has a back-end database which stores historical information about the cluster. This information can be queried by the users who are curious about how much resources they have used. It is used for adjudicating job priority on the cluster.

Cluster Jobs are Generally Run From the Command Line

Once you’ve gone through the account setup procedure, you can login to the cluster via ssh to a login node and begin using the cluster.

FASRC cluster nodes run the CentOS distribution of the Linux operating system and commands are run under the “bash” shell. As with most supercomputers work is done via command line, typing commands into a prompt, and not via a GUI (graphical user interface). There are a number of Linux and bash references, cheat sheets and tutorials available on the web. RC’s own training is also available.

Cluster Applications Should Not Be Run From Login Nodes

Once you have logged in to the cluster, you will be on one of a handful of login nodes. These nodes are shared entry points for all users and so cannot be used to run computationally intensive software. Think of them as front-ends for your work, not the place where you do your work.
Simple file copies, light text processing or editing, etc. are fine, but you should not run large graphical applications like Matlab, Mathematica, RStudio, or computationally intensive command line tools. A culling program runs on these nodes that will kill any application that exceeds memory and computational limits.
For interactive work, please start an interactive session or, if you require a GUI use our VDI system.

Storage and Scratch on the Cluster

Cluster partitions have many owned and general purpose file systems attached for use by labs and individuals to store data long-term. These are shared filesystems and are typically located in a different datacenter from the compute nodes. As such high I/O (Input/Output) from production jobs is not the best use case for your lab storage, as lab storage is not designed for jobs that need to write large amounts of data or need quick access to storage.
For best performance while running jobs please use the temporary scratch storage found at /n/netscratch. This is a VAST file system with 4 PB of storage and connected via Infiniband fabric. This temporary scratch space is available from all compute nodes.
There are lab-based 50TB quota and a 90 day retention policy on netscratch scratch. Please review the scratch policy page here. If you have not moved your data after 90 days it will be deleted to make space for other users. Please use netscratch only for reading and writing data from the cluster. Please create a subdirectory in your lab group’s folder here under /n/netscratch/[lab name] Please contact us if your lab does not have a netscratch directory or you are unable to create a sub-directory for yourself.


SLURM Resources

The primary source for documentation on Slurm usage and commands can be found at the Slurm site. Use the docs at the SchedMD site, though these are always for the latest version of Slurm. A great way to get details on the Slurm commands for the version of Slurm we run is the man pages available from the cluster. For example, if you type the following command:
man sbatch
you’ll get the manual page for the sbatch command.
Though Slurm is not as common as SGE or LSF, documentation is readily available.

Summary of Slurm Commands

The table below shows a summary of Slurm commands. These commands are described in more detail below along with links to the Slurm doc site.

SLURM SLURM Example
Submit a batch serial job sbatch sbatch runscript.sh
Run a script or application interactively
(do not use salloc on FASSE)
salloc salloc -p test -t 10 --mem 1000 [script or app]
Start interactive session
(do not use salloc on FASSE)
salloc salloc -p test -t 10 --mem 1000
Kill a job scancel scancel 999999
View status of your jobs squeue squeue -u akitzmiller
Check current job by id number sacct sacct -j 999999
Schedule recurring batch job scrontab see scrontab document for example

NOTE: No single user can submit more than 10,000 jobs at a time.

Slurm Limits

Slurm has several internal limits that users submitting large jobs or large numbers of jobs should be aware of and should plan around. These limits exist to prevent any one person from taking over the cluster and also serve to prevent the cluster being overwhelmed due to poorly formed jobs. Users must work within these limits and should plan their work accordingly. This is typically done by breaking up their workflow into smaller chunks or by deliberately serializing their jobs to increase the job time and decrease the number of cores needed. The limits are as follows:

  • Maximum Number of Jobs per User: 10,100. This is meant to prevent any one user from monopolizing the cluster.
  • Maximum Array Size: 10,000. This is both array index and size. This is meant to prevent any one user from monopolizing the cluster. Note that each array index counts as a single job for purposes of the Maximum Number of Jobs per User, so this is intentionally redundant.
  • Maximum Number of Steps: 40,000. A job step is recorded by slurm for each invocation of srun by a job. This is meant to prevent run-away jobs.

Slurm Partitions

Partition is the term that Slurm uses for queues. Partitions can be thought of as a set of resources and parameters around their use (See also: Convenient Slurm Commands). You can find out what partitions you have access to using the spart command. FASSE has different partitions than Cannon.

Note: In the case where no resources have been requested explicitly, default resources that get allocated to a job on Cannon or FASSE are, serial_requeue for the partition,  1 core, and 100 MB for the memory. Users must always declare how much time they need.

Edit
Partition Nodes Cores per Node CPU Core Types Mem per Node (GB) Time Limit Max Jobs Max Cores GPU Capable? /scratch size (GB)
sapphire 186 112 Intel
“Sapphire Rapids”
990 3 days none none No 396
shared 310 48 Intel
“Cascade Lake”
184 3 days none none No 68
bigmem 4 112 Intel
“Sapphire Rapids”
1988 3 days none none No 396
bigmem_intermediate 3 64 Intel
“Ice Lake”
2000 14 days none none No 396
gpu 36 64 Intel
“Ice Lake”
990 3 days none none Yes (4 A100/node) 396
gpu_h200 22 112 Intel “Sapphire Rapids” 990 3 days none none Yes (4 H200/node) 843
intermediate 12 112 Intel
“Sapphire Rapids”
990 14 days none none No 396
unrestricted 8 48 Intel
“Cascade Lake”
184 none none none No 68
test 18 112 Intel
“Sapphire Rapids”
990 12 Hours 5 112 No 396
gpu_test 12 64 Intel
“Ice Lake”
487 12 Hours 2 64 Yes (8 A100 MIG 3g.20GB/node) – Limit 8 per job 172
remoteviz down 32 Intel
“Cascade Lake”
373 3 days none none Shared V100 GPUs for rendering 396
serial_requeue varies varies AMD/Intel varies 3 days none none Yes varies
gpu_requeue varies varies Intel (mixed) varies 3 days none none Yes varies
PI/Lab nodes varies varies varies varies none none none varies varies

Partition Details

sapphire

The sapphire partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This queue is governed by backfill and FairShare (explained below). The sapphire partition is populated with hardware that RC runs at the MGHPCC data center in Holyoke, MA. This partition has 186 nodes connected by a InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Sapphire Rapids CPUs, 990 GB of RAM, and 400 GB of local scratch space. Each Intel CPU has 56 Cores, and 100 MB of cache.

When submitting MPI jobs on the sapphire partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space. Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric. Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.

shared

The shared partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This queue is governed by backfill and FairShare (explained below). The shared partition is populated with hardware that RC runs at the MGHPCC data center in Holyoke, MA. This partition has 310 nodes connected by a InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Cascade Lake CPUs, 184 GB of RAM, and 70 GB of local scratch space. Each Intel CPU has 48 Cores, and 48 MB of cache.

When submitting MPI jobs on the shared partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space. Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric. Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.

bigmem

This partition should be used for large memory work requiring greater than 1000 GB RAM per job, like genome / transcript assemblies. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler.

There is 3 day limit for work here. MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 4 nodes with 1988 GB of RAM

bigmem_intermediate

This partition should be used for large memory work requiring greater than 1000 GB RAM per job, like genome / transcript assemblies. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler. There is 14 day limit here.

MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 3 nodes with 2000 GB of RAM

gpu

This 36 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 64 cores and is equipped with 4 x NVidia A100s per node. One can execute scontrol show partition gpu | grep TRES to see the type of GPU available on this partition. 

There are also private partitions that may have more GPU resources, but to which access may be controlled by the owners. See our GPU Computing section for more info on using and specifying GPU resources.

gpu_h200

This 24 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 112 cores and is equipped with 4 x NVidia H200s per node.

There are also private partitions that may have more GPU resources, but to which access may be controlled by the owners. See our GPU Computing section for more info on using and specifying GPU resources. One can gain access to these owned resources only via the gpu_requeue partition.

intermediate

Serial and parallel (including MPI) jobs are permitted on this partition and this partition is intended for runs needing 3 to 14 days of runtime.

This partition has an allocation of 12 nodes of the same configuration as above for the shared partition.

unrestricted

Serial and parallel (including MPI) jobs are permitted on this partition and 365 day limit on run time. Given this, there is no guarantee of 100% uptime. Running on this partition is done at the users own risk. Users should understand that if the queue is full it could take weeks or up to months for your job to be scheduled to run.

unrestricted is made up of 8 nodes of the same configuration as above for the shared partition.

test

This partition is dedicated for interactive (foreground / live) work and for testing (interactively) code before submitting in batch and scaling. Small numbers (1 to 5) of serial and parallel jobs with small resource requirements (RAM/cores) are permitted on this partition; large numbers of interactive jobs or those requiring large resource requirements should really be done on another partition. Multiple partition submissions to this partition are forbidden (i.e. one is not permitted to do #SBATCH -p test,sapphire).

This partition is made up of 18 nodes of the same configuration as above for the sapphire partition. This smaller queue has a 12 hour maximum run time. This queue has a maximum of 112 cores and 1000 GB RAM. Jobs in this queue are not charged fairshare.

gpu_test

This 14 node partition is for individuals wishing to test GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-8 in your SLURM submission scripts. These nodes have 64 cores and are equipped with 4 x NVidia A100s in Multi-Instance GPU (MIG) mode. Each GPU has two 3g.20GB MIG instances. This queue has a maximum of 2 jobs, 64 cores, 1000 GB RAM, 8 GPU’s, 12 hour run time. This partition is intended for interactive, testing, and experimentation only. Multiple partition submissions to this partition are forbidden (i.e. one is not permitted to do #SBATCH -p gpu_test,gpu). One can execute scontrol show partition gpu_test | grep TRES to see the type of GPUs available here. 

See our GPU Computing section for more info on using and specifying GPU resources. Jobs in this queue are not charged fairshare.

remoteviz

This single node partition is for individuals who wish to use shared GPU’s for rendering graphics. The V100 cards on this node are in shared mode and are not intended for computational use but instead of rendering. You do not need to request a gpu to use this partition. Multiple partition submissions to this partition are forbidden (i.e. one is not permitted to do #SBATCH -p remoteviz,gpu).

For computation please use the gpu and gpu_test partitions.

serial_requeue

If you do not specify a partition you will be sent to this partition by default.

This partition is appropriate for single core (serial) jobs, jobs that require up to 8 cores for small periods of time (less than 1 day), or job arrays where each job instance uses less than 8 cores. Multinode jobs may be run in the partition but be advised that this is a heterogeneous partition and users are highly recommended to leverage the --constraint option to get a homogeneous block of compute and networking. The maximum runtime for this queue is 3 days. As this partition is made up of an assortment of nodes owned by other groups in addition to the general nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in.

Because serial_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared partition. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for serial_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it would be advisable to have breakpoints or branching instructions to bypass parts of work that have already been completed. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events.

gpu_requeue

This partition is appropriate for gpu jobs that require small periods of time (less than 1 day). Multinode jobs may be run in the partition but be advised that this is a heterogeneous partition and users are highly recommended to leverage the --constraint option to get a homogeneous block of compute and networking. The maximum runtime for this queue is 3 days. One will need to include #SBATCH --gres=gpu:1 in your SLURM submission scripts to get access to this partition. As this partition is made up of an assortment of gpu nodes owned by other groups in addition to the public nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in.

Because gpu_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared partition. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for gpu_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it would be advisable to have breakpoints or branching instructions to bypass parts of work that have already been completed. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events. See our GPU Computing section for more info on using and specifying GPU resources.

ITC, Kempner, HSPH, HUCE, and SEAS

For information on the partitions for these groups see:

 


Submitting Batch Jobs Using the sbatch Command

The main way to run jobs on the cluster is by submitting a script with the sbatch command. The command to submit a job is as simple as:

sbatch runscript.sh

The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from the cluster.

Tip: You can see your jobs on portal.rc.fas.harvard.edu/jobs

A typical submission script, in this case loading a Python module and having Python print a message, will look like this:

NOTE: It is important to keep all #SBATCH lines together and at the top of the script; no comments, bash code, or variables settings should be done until after the #SBATCH lines. Otherwise, Slurm may assume it’s done interpreting and skip any that follow.

#!/bin/bash
#SBATCH -c 1                # Number of cores (-c)
#SBATCH -t 0-00:10          # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p serial_requeue   # Partition to submit to
#SBATCH --mem=100           # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o myoutput_%j.out  # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors_%j.err  # File to which STDERR will be written, %j inserts jobid

# load modules
module load python/3.10.9-fasrc01

# run code
python -c 'print("Hi there.")'

In general, the script is composed of 4 parts.

  • the #!/bin/bash line allows the script to be run as a bash script
  • the #SBATCH lines are technically bash comments, but they set various parameters for the SLURM scheduler
  • loading any necessary modules and setting any variables, paths, etc.
  • the command line itself, in this case calling python and having it print a message

The #SBATCH lines shown above set key parameters. N.B. The Slurm system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./project/myfiles/myfile.txt).
#SBATCH -c 1
This line sets the number of cores (threads) that you’re requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, Slurm assumes -c 1. For more on parallel work see: threads, MPI
#SBATCH -t 0-01:00
This line specifies the running time for the job in minutes. Other acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. If your job runs longer than the value you specify here, it will be canceled. Jobs have a maximum run time which varies by partition (see table above), though extensions can be done. There is no fairshare penalty for over-requesting time, though it will be harder for the scheduler to backfill your job if you overestimate. NOTE! Users must declare how much time they need.
#SBATCH -p serial_requeue
This line specifies the Slurm partition (AKA queue) under which the script will be run. The serial_requeue partition is good for routine jobs that can handle being occasionally stopped and restarted. PENDING times are typically short for this queue. See the partitions description above for more information. If you do not specify this parameter you will be given serial_requeue by default.
#SBATCH --mem=100
The FASRC cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. The --mem option specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g. MPI code), then you must use the --mem-per-cpu option, as this will allocate the amount specified for each of the cores you’re requesting, whether it is on one node or multiple nodes. If this parameter is omitted, then you are granted 100 MB by default. Chances are good that your job will be killed as it will likely go over this amount, so one should always specify how much memory you require.
#SBATCH -o myoutput_%j.out
This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
#SBATCH -e myerrors_%j.err
This line specifies the file to which standard error will be appended. Slurm submission and processing errors will also appear in the file. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.err in the current directory.
#SBATCH --test-only
While not shown above, adding this option to your script will tell the scheduler to return information on what would happen if you submit this job. This is a good and easy way to determine if you script is viable as well as give a rough estimate of how long it would take to schedule in the current queue load.
#SBATCH --account=some_lab
If you are in more than one lab, please ensure that you are charging your Fairshare to the appropriate group by using this option in all of your job scripts and specifying the lab group.

Notifications by Email:

The scheduler can send email to you for various job states (FAIL and END being the most useful). But please bear in mind that this must be used responsibly as one user can quickly overwhelm the mail system and affect the notifications of all users by clogging up the mail queue. Keep in mind that tens or even hundreds of thousands of jobs may be in flight at a given time. This is why below we will strongly caution against using the ALL mail type. If you are using a metascheduler, job arrys, or just many jobs, please try to avoid adding too much burden to the email queue; Sending hundreds or thousands of emails can cause email backups, not to mention fill up your inbox.

To add mail notification to your job script you can use the --mail-type

SBATCH command. Example:

#SBATCH --mail-type=END #This command would send an email when the job ends.

Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (Please avoid: Equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE, and STAGE_OUT), INVALID_DEPEND (dependency never satisfied), STAGE_OUT (burst buffer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), TIME_LIMIT_50 (reached 50 percent of time limit) and ARRAY_TASKS (Please also avoid: Send emails for each array task).

Multiple type values may be specified in a comma separated list. The user to be notified is indicated with --mail-user. Unless the ARRAY_TASKS option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

#SBATCH --mail-user=ajk@123.com #Email to which notifications will be sent

It is important to accurately request resources, especially memory

The FASRC cluster is a large, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs. If insufficient memory is allocated, your program may crash (often in an unintelligible way); if too much memory is allocated, resources that could be used for other jobs will be wasted. Additionally, your “fairshare“, a number used in calculating the priority of your job for scheduling purposes, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores (-n) and memory (--mem or --mem-per-cpu).
Many scientific computing tools can take advantage of multiple processing cores, but many cannot. A typical R script, for example will not use multiple cores. On the other hand, RStudio, a graphical console for R is a Java program that is improved substantially by using multiple cores. Or, you can use the Rmpi package and spawn “slaves” that correspond to the number of cores you’ve selected.
The distinction between --mem and --mem-per-cpu is important when running multi-core jobs (for single core jobs, the two are equivalent). --mem sets total memory across all cores, while --mem-per-cpu sets the value for each requested core. If you request two cores (-n 2) and 4 Gb with --mem, each core will receive 2 Gb RAM. If you specify 4 Gb with --mem-per-cpu, each core will receive 4 Gb for a total of 8 Gb. A good distinction between the two is that --mem-per-cpu is for MPI jobs and --mem is for all other types.
The #SBATCH --test-onlyoption is a good way to sanity check your scripts before submitting them. Just remember to remove it after running your test.

Monitoring Job Progress with squeue and sacct

squeue and sacct are two different commands that allow you to monitor job activity in SLURM. sacct talks directly to the slurm accounting database and provides both live and historic data (up to 6 months). sacct with out any options will print out all the jobs you have run in the past day. sacct -j 999999 will show you a specific job. Note that sacct is almost live data, in addition the various accounting fields (such as memory usage) are incomplete until the job finishes. If you want current data on memory usage or other counters use the sstat command.

sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory and CPU usage for an array job (see below for details about job arrays):

[jharvard@boslogin01 ~]? sacct -j 44375501 --format JobID,Elapsed,ReqMem,MaxRSS,AllocCPUs,TotalCPU,State   
JobID      Elapsed    ReqMem   MaxRSS AllocCPUS TotalCPU State
------------ ---------- --------- ------- ---------- ---------- ----------
44375501_[1+ 00:00:00   40000Mc           8    00:00:00   PENDING
44375501_1   2-03:50:53 40000Mc           8    2-03:50:23 COMPLETED
44375501_1.+ 2-03:50:53 40000Mc 34372176K 6    2-03:50:23 COMPLETED
44375501_1.+ 2-03:50:53 40000Mc 1236K     8    00:00.004  COMPLETED
44375501_2   1-23:47:35 40000Mc           8    1-23:47:18 COMPLETED
44375501_2.+ 1-23:47:35 40000Mc 34467196K 6    1-23:47:17 COMPLETED
44375501_2.+ 1-23:47:36 40000Mc 1116K     8    00:00.003  COMPLETED
44375501_3   1-23:32:36 40000Mc           8    1-23:32:15 COMPLETED
44375501_3.+ 1-23:32:36 40000Mc 34389040K 6    1-23:32:15 COMPLETED
44375501_3.+ 1-23:32:37 40000Mc 1224K     8    00:00.004  COMPLETED
44375501_4   1-21:59:30 40000Mc           8    1-21:59:07 COMPLETED
44375501_4.+ 1-21:59:30 40000Mc 34389044K 6    1-21:59:07 COMPLETED

The seff and seff-account commands are summary commands based off the data in sacct.

Running squeue without arguments will list all your currently running, pending, and completing jobs. If you include the -l option (for “long” output) you can get useful data, including the running state of the job.

[jharvard@boslogin01 ~]?squeue -u jharvard -l
Thu May 31 10:59:05 2018
    JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
44768543_24 shared longseq2 mmcfee RUNNING 20:34:20 1-12:00:00 1 holy7c09106
44768543_23 shared longseq2 mmcfee RUNNING 20:34:55 1-12:00:00 1 holy7c15302
44768543_22 shared longseq2 mmcfee RUNNING 20:52:34 1-12:00:00 1 holy7c15310
44768543_10 shared longseq2 mmcfee RUNNING 23:30:38 1-12:00:00 1 holy7c05312
44768543_11 shared longseq2 mmcfee RUNNING 23:30:38 1-12:00:00 1 holy7c09211
44768518_24 shared shortseq mmcfee RUNNING 23:32:21 1-12:00:00 1 holy7c13111

Both tools provide information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, or FAILED.

PENDING Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.
RUNNING Job is running.
COMPLETED Job has finished and the command(s) have returned successfully (i.e. exit code 0).
CANCELLED Job has been terminated by the user or administrator using scancel.
FAILED Job finished with an exit code other than 0.

See Broader Queue with showq

The showq command can be used to show what the rest of the partition looks like. Often your job is pending due to other people in the partition. The showq command then shows you an overview of all the jobs for a specific partition. showq is invoked by doing:
showq -o -p shared
Where -o orders the pending queue by priority, with the next job to be scheduled at the top. -p specifies the partition that you want to look at.

Canceling Jobs with scancel

If for any reason, you need to cancel a job that you’ve submitted, just use the scancel command with the job ID.
scancel 9999999
If you don’t keep track of the job ID returned from sbatch, you should be able to find it with the squeue or sacct command described above.

Interactive Jobs and salloc

PLEASE NOTE: If you are attempting to use salloc on FASSE, please use the FASSE VDI instead.

Though batch submission is the best way to take full advantage of the compute power the cluster, foreground/interactive jobs can also be run. These can be useful for things like:

  • Iterative data exploration at the command line
  • RAM intensive graphical applications like MATLAB or SAS
  • Interactive “console tools” like R and iPython
  • Significant software development and compiling efforts

An interactive job differs from a batch job in two important aspects: 1) the partition to be used is the test partition (though any partition in Slurm can be used for interactive work) and, 2) jobs should be initiated with the salloc command instead of sbatch. The command salloc will start a command line shell on a compute node.

Note that you should not include /bin/bash as part of your salloc line, as it will simply execute that command and exit. Instead simply run salloc with only your resource paramterts and it will put you in an interactive session.

This command: salloc --partition test --mem 500 --time 0-06:00 will start a command line shell on the test queue with 500 MB of RAM for 6 hours; 1 core on 1 node is assumed as this parameter (-c 1) were left out. When the interactive session starts, you will notice that you are no longer on a login node, but rather one of the compute nodes dedicated to this queue.

salloc --partition test --x11 --mem 4G --time 0-06:00

In this case, we’ve asked for more memory because we plan to run MATLAB which requires a larger memory footprint. The --x11 option allows XWindows to operate between the login and compute nodes. See also: Virtual Desktop (VDI)

Interactive sesssions require you to be active in the session. If you go more than an hour without any kind of input, it will assume that you have left the session and will terminate it. If you have interactive tasks that must stretch over days, we recommend you print to screen occasionally to keep the connection open.


Software – Using Modules to Access Software

<

p style=”padding-left: 40px;”>Click to Expand This Section

Because of the diversity of projects currently supported by FAS, and because the cluster is not a single computer on which you install software directly, thousands of applications and libraries are supported on the FASRC cluster. Technically, it is impossible to include all of these tools in every user’s environment.

Search available modules here
(https://portal.rc.fas.harvard.edu/apps/modules)

The Research Computing and Informatics departments have developed an enhanced Linux module system, Helmod, based on the hierarchical Lmod module system from TACC. Helmod enables applications much the same way as Linux modules, but also prevents multiple versions of the same tool from being loaded at the same time and separates tools that use particular compilers or MPI libraries entirely.
A module load command enables a particular application in the environment, mainly by adding the application to your PATH variable and pulling in dependencies. For example, to enable the 3.4.2 version of the R package:
module load R/3.4.2-fasrc01
Once a module is loaded inside a session/shell, it is available just as though you’d just installed it.

[jharvard@boslogin01 ~]? which R
 R: Command not found.
[jharvard@boslogin01 ~]? module load R/3.4.2-fasrc01
[jharvard@boslogin01 ~]? which R
[jharvard@boslogin01 ~]? /n/helmod/apps/centos7/Core/R_core/3.4.2-fasrc01/bin/R


Loading more complex modules can affect a number of environment variables including
PYTHONPATH, LD_LIBRARY_PATH, PERL5LIB, etc. Modules may also load dependencies. Bear in mind, you will need to include module load statements in your SBATCH scripts. If you load a module on, say, a login node and then launch a job, that job will run on another node and in a new shell where the module has not been loaded.
To determine what has been loaded in your environment, the module list command will print all loaded modules.
The module purge command will remove all currently loaded modules. This is particularly useful if you have to run incompatible software (e.g. python 2.x or python 3.x). The module unload command will remove a specific module.
Finding the modules that are appropriate for your needs can be done in a couple of different ways. The module search page allows you to browse and search the list of modules that have been deployed to the cluster.
There are a number of command line options for module searching, including the module avail command for browsing the entire list of applications and the module-query command for keyword searching. But please note: the online module search is much more thorough and has additional information on each module. module-avail may not show you all available options.
Though there are many modules available by default, the hierarchical Helmod system enables additional modules after loading certain key libraries such as compilers and MPI packages. The module avail command output reflects this.

[jharvard@boslogin01 ~]? module load gcc/7.1.0-fasrc01
[jharvard@boslogin01 ~]? module avail
---------------------------- /n/helmod/modulefiles/centos7/Core ----------------------------
ADOL-C/2.5.2-fasrc01       bzip2/1.0.6-fasrc01
julia/0.6.2-fasrc01        phyml/2014Oct16-fasrc01
ATAC-seq/0.1-fasrc02       cd-hit/4.6.4-fasrc02
julia/0.6.2-fasrc02        plink/1.90-fasrc01
Anaconda/5.0.1-fasrc01     cellranger/2.1.0-fasrc01
julia/0.6.3-fasrc01        progressiveCactus/20180313-fasrc01
Anaconda3/5.0.1-fasrc01    centos6/0.0.1-fasrc01
julia/0.6.3-fasrc02 (D)    proj/4.9.3-fasrc01
BEAST/2.4.8-fasrc01        centrifuge/1.0.3.5c51ac-fasrc02
kalign/2.0-fasrc01         proj/5.0.1-fasrc01 (D)
BaitFisher-package/e92dbf28b-fasrc01 centrifuge/1.0.3.8a9a820-fasrc01 (D)
kallisto/0.43.1-fasrc02    prokka/1.12-fasrc02
CLAPACK/3.2.1-fasrc01      clustalo/1.2.0-fasrc01
kraken/1.1-fasrc01         psmc/0.6.5-fasrc01
--More--


The
module-query command supports more sophisticated queries and returns additional information for modules. If you query by the name of an application or library (e.g. hdf5), you’ll retrieve a consolidated report showing all of the modules grouped together for a particular application. The online module search is much more thorough as it will show you all available versions.

[jharvard@boslogin01 ~]? module-query hdf5
module-query hdf5
------------------------------------------------------------------------------------------------------------
hdf5
------------------------------------------------------------------------------------------------------------
Built for: centos7
Description:
HDF5 is a data model, library, and file format for storing and managing data. It supports
an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for
high volume and complex data. HDF5 is portable and is extensible, allowing applications to
evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for
managing, manipulating, viewing, and analyzing data in the HDF5 format. HDF5 is used as a
basis for many other file formats, including NetCDF.
Versions:
hdf5/1.10.1-fasrc03..................... Core Core module for CentOS 7
hdf5/1.10.1-fasrc02..................... Comp
hdf5/1.10.1-fasrc01..................... MPI
hdf5/1.8.12-fasrc12..................... MPI
hdf5/1.8.12-fasrc09..................... Comp Compiler-specific build
hdf5/1.8.12-fasrc08..................... Core Added c++ bindings
To find detailed information about a module, enter the full name.
For example,
module-query hdf5/1.8.12-fasrc08


A query for a single module, however, will return details about that build including module load statements and build comments (if any exist).

[jharvard@boslogin01 ~]? module-query hdf5/1.10.1-fasrc01 
------------------------------------------------------------------------------------------------------------
hdf5 : hdf5/1.10.1-fasrc01
------------------------------------------------------------------------------------------------------------
Built for: centos7
Description:
HDF5 is a data model, library, and file format for storing and managing data. It supports
an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for
high volume and complex data. HDF5 is portable and is extensible, allowing applications to
evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for
managing, manipulating, viewing, and analyzing data in the HDF5 format. HDF5 is used as a
basis for many other file formats, including NetCDF.
This module can be loaded as follows:
module load gcc/7.1.0-fasrc01 openmpi/2.1.0-fasrc02 hdf5/1.10.1-fasrc01
module load gcc/7.1.0-fasrc01 mvapich2/2.3b-fasrc02 hdf5/1.10.1-fasrc01
module load intel/17.0.4-fasrc01 openmpi/2.1.0-fasrc02 hdf5/1.10.1-fasrc01
module load intel/17.0.4-fasrc01 mvapich2/2.3b-fasrc02 hdf5/1.10.1-fasrc01
This module also loads:
zlib/1.2.8-fasrc07 szip/2.1-fasrc02

For more details about the Helmod module system, check out the Software on The Cluster page (this has been updated to reflect our upgrade to CentOS7).
For more details about errors in loading modules after the O3 upgrade, check out the CentOS7 FAQ.


Remote desktop access

For a GUI native X11 interface, you can connect to the cluster using our Open OnDemand VDI system. This is more reliable and stable than X11 forwarding back to your computer. Remote desktop access is particularly useful for heavy client applications like MATLAB, Jupyter, and R Studio where the performance of X11 forwarding is decidedly poor.

Open OnDemand (aka OOD) servers are available on both the Cannon and FASSE clusters.


Using GPUs

To request a single GPU on slurm just add #SBATCH --gres=gpu to your submission script and it will give you access to a GPU. To request multiple GPUs add #SBATCH --gres=gpu:n where ‘n’ is the number of GPUs. Note that --gres specifies the resources on a per node basis, so for multinode work you only need to specify how many gpus you need per node. For more on GPU computing see our more indepth GPGPU Document.

Specifying GPU Type

For users who wish to specify which type of GPU they wish to use, especially for those using heterogeneous partitions like gpu_requeue, there are two methods that can be used. The first is using --constraint="<tag>", this will constrain the job to only run on gpus of a certain class. A full listing of constraints can be found below. The second method is defining the specific model you want using --gres=gpu:<model>:1. For example if you want a A100 with 80GB of onboard memory then you would specify --gres=gpu:nvidia_a100-sxm4-80gb:1.

a100

  • nvidia_a100-pcie-40gb: Nvidia A100 PCIe 40GB
  • nvidia_a100-sxm4-40gb: Nvidia A100 SXM4 40GB
  • nvidia_a100-sxm4-80gb: Nvidia A100 SXM4 80GB

h100 & h200

  • nvidia_h100_80gb_hbm3: Nvidia H100 80GB HBM3
  • nvidia_h200: Nvidia H200 140GB

mig

  • nvidia_a100_1g.5gb: Nvidia A100 1g MIG 5GB
  • nvidia_a100_1g.10gb: Nvidia A100 1g MIG 10GB
  • nvidia_a100_2g.10gb: Nvidia A100 2g MIG 10GB
  • nvidia_a100_3g.20gb: Nvidia A100 3g MIG 20GB
  • nvidia_a100_3g.39gb: Nvidia A100 3g MIG 40GB
  • nvidia_a100_4g.20gb: Nvidia A100 4g MIG 20GB
  • nvidia_a100_4g.39gb: Nvidia A100 4g MIG 40GB

v100

  • tesla_v100-pcie-16gb: Nvidia V100 PCIe 16GB
  • tesla_v100-pcie-32gb: Nvidia V100 PCIe 32GB
  • tesla_v100-sxm2-16gb: Nvidia V100 SXM2 16GB
  • tesla_v100-sxm2-32gb: Nvidia V100 SXM2 32GB
  • tesla_v100s-pcie-32gb: Nvidia V100S PCIe 32GB

a40

  • nvidia_a40: Nvidia A40 40GB

rtx

  • nvidia_rtx_a6000: Nvidia RTX A6000 PCIe 48GB

Some of the GPUs listed here were purchased by specific groups and only available via gpu_requeue. To find out what specific types of gpu’s are available on a partition run scontrol show partition <PartitionName> and look under the TRES category.


Parallelization

Using Threads such as OpenMP

One of the basic methods for parallelization is to use a threading library, such as pthreads, OpenMP, or applications that use OpenMP under the hood (e.g. numpy, OpenBLAS). Slurm by default does not know what cores to assign to what process it runs, in addition for threaded applications you need to make sure that all the cores you request are on the same node. Below is an example script that both ensures all the cores are on the same node, and lets Slurm know which process gets the cores that you requested for threading.

#!/bin/bash
#SBATCH -c 8 # Number of threads
#SBATCH -t 0-00:30:00 # Amount of time needed DD-HH:MM:SS
#SBATCH -p sapphire # Partition to submit to
#SBATCH --mem-per-cpu=100 #Memory per cpu
module load intel/21.2.0-fasrc01
srun -c $SLURM_CPUS_PER_TASK MYPROGRAM > output.txt 2> errors.txt

The most important aspect of the threaded script above is the -c option which tells Slurm how many threads you intend to run with. If you are using OpenMP you will want notify it of how many threads it can use by setting OMP_NUM_THREADS before the executable:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Using MPI

MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. OpenMPI and MVAPICH2 are available as modules on the cluster as well as an Intel specific library.
As described in the Helmod documentation, MPI libraries are a special class of module, called “Comp”, that is compiler dependent. To load an MPI library, load the compiler first.

module load intel/21.2.0-fasrc01 openmpi/4.1.1-fasrc01

Once an MPI module is loaded, applications built against that library are made available. This dynamic loading mechanism prevents conflicts that can arise between compiler versions and MPI library flavors.
An example MPI script with comments is shown below:

#!/bin/bash
#SBATCH -n 128 # Number of cores
#SBATCH -t 10 # Runtime in minutes
#SBATCH -p sapphire # Partition to submit to
#SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem)
module load intel/21.2.0-fasrc01 openmpi/4.1.1-fasrc01
module load MYPROGRAM
srun -n $SLURM_NTASKS --mpi=pmix MYPROGRAM > output.txt 2> errors.txt

There are a number of important aspects to an MPI SLURM job.

  • MPI jobs must be run on a partition that supports MPI interconnects. sapphire, shared, test, general, unrestricted are MPI-enabled, but serial_requeue includes non-MPI resources and should be avoided.
  • Memory should be allocated with the --mem-per-cpu option instead of --mem so that memory matches core utilization.
  • The -np option for mpirun or mpiexec (when these runners are used) should use the bash variable $SLURM_NTASKS so that the correct number of cores is passed to the MPI engine at runtime.
  • If network topology and communications overhead is a concern for your code, try using the --contiguous option which will ensure that all the cores you get will be adjacent to each other. Use this with caution though as it will make your job pend longer, as finding contiguous blocks of compute is difficult. Verify that the boost in performance is worth the extra wait time in the queue. If you do not include this option you will be given cores and what ever nodes that Slurm can find, which may be scattered across the cluster. Depending on your code this may or may not be a concern. Test your code in both modes to see if it is an option that is worth including if you don’t know off hand. It may not be worth including --continguous as the aggregate time of waiting plus runtime may be longer with --contiguous. The sbatch and srun documentation have more information on various fine tuning options.
  • The application must be MPI-enabled. Applications cannot take advantage of MPI parallelization unless the source code is specifically built for it. All such applications in the Helmod module system can only be loaded if an MPI library is loaded first.

Job Arrays

SLURM allows you to submit a number of “near identical” jobs simultaneously in the form of a job array. To take advantage of this, you will need a set of jobs that differ only by an “index” of some kind.
For example, say that you would like to run tophat, a splice-aware transcript-to-genome mapping tool, on 30 separate transcript files named trans1.fq, trans2.fq, trans3.fq, etc. First, construct a SLURM batch script, called tophat.sh, using special SLURM job array variables:

#!/bin/bash
#SBATCH -J tophat # A single job name for the array
#SBATCH -c 1 # Number of cores
#SBATCH -p serial_requeue # Partition
#SBATCH --mem 4000 # Memory request (4Gb)
#SBATCH -t 0-2:00 # Maximum execution time (D-HH:MM)
#SBATCH -o tophat_%A_%a.out # Standard output
#SBATCH -e tophat_%A_%a.err # Standard error
module load tophat/2.0.13-fasrc02
tophat /n/netscratch/informatics_public/ref/ucsc/Mus_musculus/mm10/chromFatrans"${SLURM_ARRAY_TASK_ID}".fq

Then launch the batch process using the --array option to specify the indexes.
sbatch --array=1-30 tophat.sh
In the script, two types of substitution variables are available when running job arrays. The first, %A and %a, represent the job ID and the job array index, respectively. These can be used in the sbatch parameters to generate unique names. The second, SLURM_ARRAY_TASK_ID, is a bash environment variable that contains the current array index and can be used in the script itself. In this example, 30 jobs will be submitted each with a different input file and different standard error and standard out files.
More detail can be found on the SLURM job array documentation page.


Checkpointing

Slurm does not automatically checkpoint, i.e. create files that your job can restart from. To protect against job failure (due to code error or node failure) and to allow your job to be broken up into smaller chunks it is always advisable to checkpoint your code so it can restart from where it left off. This is especially valuable for jobs on partitions subject to requeue, but is also just generally useful for any type of job. Checkpointing varies from code type to code type and needs to be implemented by the user as part of their code base. Some resources for checkpointing codes that do not have them built-in include Distributed MultiThreaded CheckPointing (DMTCP) and Checkpoint/Restore in Userspace (CRIU).

Job dependencies

Many scientific computing tasks consist of serial processing steps. A genome assembly pipeline, for example, may require sequence quality trimming, assembly, and annotation steps that must occur in series. Launching each of these jobs without manual intervention can be done by repeatedly polling the controller with squeue / sacct until the State is COMPLETED. However, it’s much more efficient to let the SLURM controller handle this using the --dependency option.

[akitzmiller@boslogin01 examples]? sbatch assemble_genome.sh
Submitted batch job 53013437
[akitzmiller@boslogin01 examples]? sbatch --dependency=afterok:53013437 annotate_genome.sh
[akitzmiller@boslogin01 examples]?

When submitting a job, specify a combination of “dependency type” and job ID in the --dependency option. afterok is an example of a dependency type that will run the dependent job if the parent job completes successfully (state goes to COMPLETED). The full list of dependency types can be found on the SLURM doc site in the man page for sbatch. It is best not to create a chain of dependencies that is greater than 2-3 levels. Any more than that and the scheduler will become significantly slower. Dependencies should only be used if the resource requirements between each step are significantly different, or if you need to wait for an array to complete before you run a single job that processes all the array results. Be sure to think about whether you truly need dependencies or not.

Job Constraints

Sometimes, especially on the requeue partitions, jobs need to be constrained to run on specific hardware. Many times this is due to either the code being compiled for a specific architecture or because the code runs more efficiently on a specific type of host. Slurm provides for this functionality via the --constraint option (see the sbatch documentation for usage details). The features for constraint are defined by FASRC and fall into three broad categories: Processor, GPU, and Network. You can match against multiple of these but keep in mind the more constraints you use the longer your job will pend for as the scheduler will find it more difficult to find nodes that fit your needs. A list of the features available on the cluster follows, you can also see the features for a specific node by doing scontrol show node NODENAME.

Processor

  • amd: All AMD processors
  • intel: All Intel processors
  • avx: All processors that are AVX capable
  • avx2: All processors that are AVX2 capable
  • avx512: All processors that are AVX512 capable
  • milan: AMD Milan chips
  • genoa: AMD Genoa chips
  • skylake: Intel Skylake chips
  • sapphirerapids: Intel Sapphire Rapids
  • cascadelake: Intel Cascade Lake chips
  • icelake: Intel Ice Lake chips

GPU

To specify a GPU model, for example, A100 with 80GB refer to Specifying GPU Type

  • rtxa6000: Nvidia RTX A6000 GPU
  • a40: Nvidia A40 GPU
  • v100: Nvidia V100 GPU
  • a100: Nvidia A100 GPU
  • a100-mig: Nvidia A100 GPU MIG
  • h100: Nvidia H100 GPU
  • h200: Nvidia H200 GPU

Network

  • holyhdr: Holyoke HDR Infiniband Fabric
  • holyndr: Holyoke NDR Infiniband Fabric

Troubleshooting Jobs and Resource Usage

A number of factors, including fair-share are used for job scheduling

We use a multifactor method of job scheduling on the cluster. Job priority is assigned by a combination of fair-share and length of time a job has been sitting in the queue. You can find out the priority calculation for your jobs by using the sprio command, such as sprio -j JOBID.
You can find a description of how SLURM calculates Fair-share here. Fairshare is shared on a lab basis, so usage by any member of the lab will impact the score of the whole lab as the lab is pulling from a common pool. Fairshare has a 3 day halflife and naturally recovers if your lab does not run any jobs. Thus it is wise to store up fairshare if you need to do significant runs, and plan your runs accordingly in order to maintain a good fairshare score. You can learn more about your fairshare score and slurm usage by using the sshare command, such as sshare -U which shows your current score. Contact RC if you want to get graphs of your usage and fairshare over time.
The other factor in priority is how long you have been sitting in the queue. The longer your job sits in the queue the higher its priority grows, out to a maximum of 3 days. If everyone’s priority is equal then FIFO (first in first out) is the scheduling method. We weight the age of a job that has pended for 3 days to be equal to a fairshare score of 0.5.
We also have backfill turned on. This allows for jobs which are smaller to sneak in while a larger higher priority job is waiting for nodes to free up. If your job can run in the amount of time it takes for the other job to get all the nodes it needs, SLURM will schedule you to run during that period. This means knowing how long your code will run for is very important and must be declared if you wish to leverage this feature. Otherwise the scheduler will just assume you will use the maximum allowed time for the partition when you run. The better your constrain your job in terms of CPU, Memory, and Time the easier it will be for the backfill scheduler to find you space and let your job jump ahead in the queue.

Troubleshooting common problems

A variety of problems can arise when running jobs on the cluster. Many are related to resource misallocation, but there are other common problems as well.

Error Likely cause
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit, being killed Your job is attempting to use more memory than you’ve requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced. For jobs that require truly large amounts of memory (>256 Gb), you may need to use the bigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp.
SLURM_receive_msg: Socket timed out on send/recv operation This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to.
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE This message may arise for a variety of reasons, but it typically indicates that the host on which your job was running can no longer be contacted by SLURM. Jobs that die from NODE_FAILURE are automatically requeued by the scheduler.

 

]]>
10401