Cluster Usage – FASRC DOCS

Slurm Stats

RC Admin — Tue, 27 Aug 2024 15:39:58 +0000

Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis we pull data from the scheduler for the day and display a summary for you when you log in to the cluster in an easy to read table. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we are providing along with recommendations of where to go to get more information or to improve your performance.

The Statistics

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.003943                         |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:     25 |         4 |        1 |     20 |      0 |      0 |
| GPU:     98 |        96 |        1 |      1 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     4.3 |       5.5 |      69.4% |    133.00 CPU Hrs |
| Memory |   22.2G |     27.2G |      68.3% |                   |
| GPUS   |     N/A |       1.0 |        N/A |    100.20 GPU Hrs |
| Time   |  14.57h |    45.38h |      45.9% |             0.00h |
+---------------------------------------------------------------+

Above is what you will see when you login to the cluster if you have run jobs in the last day. This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2024-08-13 -E 2024-08-20

For more detailed information on specific jobs you can use the seff and sacct commands. If you want summary plots of various statistics please see our XDMod instance (requires RC VPN). For fairshare usage plots see our Cannon and FASSE Fairshare Dashboards (requires RC VPN). Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. This is as of the end of the day indicated. Lower fairshare means lower priority for your jobs on the cluster. For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU. Next the total number of jobs in that category is given, followed by a break down by state. Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally). Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator. Failed jobs are those jobs that the scheduler has detected as having a faulty exit. Out of Memory jobs are those that hit the requested memory limit set in the job script. Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time. Average Used is the average amount actually used by the job. Average Allocated is the average amount of resources allocated by the job script for the job. Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs. In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not. Have unused resources that you have allocated means that your code is not utilizing all the space you’ve set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, seff, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs. We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Finally in the case of GPUs, slurm does not currently gather statistics on actual usage and thus we can’t construct an efficiency metric. That said if you want to learn more about how your job is performing check out the Job Efficiency and Optimization doc as well as our GPU monitoring documentation. Tools like nvidia-smi and nvtop can be useful for monitoring your usage interactively.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.

Job Efficiency and Optimization Best Practices

RC Admin — Mon, 24 Jun 2024 18:53:02 +0000

Overview

The art of High Performance Computing is really the art of getting the most out of the computational resources you have access to. This applies to working on a laptop, to working in the cloud, or working on a supercomputer. While this diversity of different systems and environments may seem intimidating, in reality there are some good general rules and best practices that you can use to get the most out of your code and the computer you are on.

As defined in our Glossary, the term job has a broad and a narrow sense. In the broad sense, a job is an individual run of an application, code, or script; and may be used interchangeable with those terms. This includes whether you run it from the command line, cronjob, or use a scheduler. In the narrow sense, a job is an individual allocation for a user by the scheduler. It is usually obvious from the context which is meant.

By Job Efficiency, we mean that the parameters of the job in terms of cpu, gpus, memory, network, time, etc. (refer to Glossary for definitions) are accurately defined and match what the job actually uses. As an example, a job that asks for 100 cores but only uses 1 is not efficient. A job that asks for 100GB and uses 99GB is efficient. Efficiency is a measure of how well the user has scoped their job so that it can run in the space defined.

Finally Job Optimization means to make the job run at the maximum speed possible with the least amount of resources used. For example, a poorly optimized code may only use 50% of the gpu it was allocated, whereas a well optimized one could use 100% and see acceleration commensurate with that improved usage. Similarly, a poorly optimized code may use 1TB of memory, but a well optimized code may only use 100GB. Optimization is a measure of how well structured a code is numerically, both in terms of algorithm and implementation, so that it can get to the solution in the fastest, most accurate, and most economical way.

Efficiency and Optimization are thus two sides of the same coin. Efficiency is about accurately defining the resources that you will use and optimization is about reducing that usage. Both have the goal of getting the most out of the resources the job is using.

Architecture

Schematic showing how cores, memory, and nodes are arranged on the cluster.

Before we get into talking about Job Efficiency and Optimization, we should first discuss general cluster architecture. Supercomputers (aka clusters) are essentially a bunch of computers of similar type networked together by a high speed interconnect so that they can act in unison with each other and have a common computational environment. The fundamental building block of a cluster is a node. Each node is composed of a bunch of cores that all talk to the same block of memory. GPU nodes have in addition to cores and memory, GPU’s which can be used for specialized workflows such as machine learning. The nodes are then strung together with a network, typically Infiniband, and then a scheduler is put in front to handle what jobs get what resources.

CPU/GPU Type

Typically a cluster will be made up of a uniform set of hardware. However that is not necessarily the case. At FASRC we run a variety of hardware, spanning multiple generations and vendors. These different types of CPU and GPU have various performance characteristics and features that may have impacts on your job. We will talk about this later but being cognizant of what hardware your job works best on will be important for efficient and optimal use of the cluster. At FASRC we split up our partitions such that each partition has a uniform set of hardware, unless otherwise noted (e.g. gpu_requeue and serial_requeue). A comprehensive list of available hardware can be found in the Job Constraints section of the Running Jobs page. You can learn more about a specific node’s hardware by running scontrol show node NODENAME.

Network Topology

How the nodes are interconnected on the Infiniband is known as the topology. Topology becomes very important as you run larger and larger jobs, more on that later. At FASRC we generally follow a Fat Tree Infiniband topology, with a hierarchy of switching and adjacent nodes being close to each other network-wise.

We name our nodes after their location in the datacenter, which makes it easy to figure out which nodes are next to each other both in terms of space and in terms of network. Our naming convention for nodes is Datacenter/Node has a GPU or Not/Row/Pod/Rack/Chassis/Blade. So for example the node holy7c04301 is at our Holyoke Data Center in Row 7 Pod C Rack 04 Chassis 3 Blade 01, the adjacent nodes would be holy7c04212 (which is the last blade on the chassis below Chassis 3 as there are 12 blades per chassis for this hardware) and holy7c04302. Another is example is holygpu8a11404, this node is a GPU node in our Holyoke Data Center in Row 8 Pod A Rack 11 Chassis 4 Blade 4, the adjacent nodes would be holygpu8a11403 and holygpu8a11501 (which is the next blade on the chassis above Chassis 4 as there are only 4 blades per chassis for this hardware). You can see the full topology of the cluster by doing scontrol show topology.

Job Efficiency

The first step in improving job efficiency is understanding your job as it exists today. Understanding means you have a good handle on the resource needs and characteristics of your job, and thus you are able to accurately allocate resources to it thereby improving efficiency. As a general rule, one should always understand the jobs you run regardless of size. This knowledge is both beneficial for right sizing your requested resources, but also for noticing any pitfalls that may occur when scaling the job up.

There are two ways of learning about your job. The first is to have a fundamental understanding of the job you are running. Based on your knowledge of the algorithm, code, job script, and cluster architecture, you know what you should request for core count, gpu count, memory, time, and storage. Knowing your code at this level will allow you to make the most accurate estimates for what you will use.

While a full understanding of your job is ideal, often it is not possible. You may not control the code base, you may just be getting started, or you may not have time to obtain a deep understanding of the job. Even in cases where you have a good theoretical understanding of your job, you need to confirm that knowledge with hard data. In which case the second method is to test your job empirically and find out what the best job parameters are. Simply take an example that you know will be akin to what you will run in production and run it as a test job. Then once the test job is done, check to see how it performed. You then repeat, changing the job parameters until you have a good understanding of how your job performs in different situations.

That’s the rough sketch, but the details are a bit different depending on what you want to understand. Below are some methods for finding out how much memory, cores, gpus, time, and storage your job will need. These may not cover every job but should work for most situations.

Memory

Memory on the cluster is doled out in two different ways, either by node (--mem) or by core (--mem-per-cpu). If your job exceeds its memory request, you will see an error either containing Out of Memory or oom. This indicates that the scheduler terminated your job for exceeding your memory request. You will need to increase your memory allocation to give your job more space.

Here is a test plan for figuring out how much memory you should request:

Come up with an initial guess as to how much memory your job will require. A good first guess is usually 2-3 times the size of any data files you are reading in, or 2-3 times the size of the data you will be generating. If you do not know either of those then a safe initial guess is 4GB. Most of the cluster has 4GB per core, so its a good initial guess that will allow you to get through the scheduler in a quick manner.
Run a test job on the test partition with your guess.
Check the result of your run using seff or sacct. (Note: GPU stats (usage and onboard GPU memory) are not given by either of these commands, only CPU usage and CPU memory)
If your job ran out of memory then double the amount and return to step 2. If it ran properly (i.e, no out of memory error), then look at how much your job actually used and update your request to match with an additional 10% buffer as the scheduler’s sampling of the memory usage runs every 30 seconds and it may have missed any short term memory spikes.

Every time you change a parameter in your code, you should check to see how the memory changes. Some parameters will not change the memory usage at all. Others will change it dramatically. If you do not know if a parameter will change the memory usage run a test to see how it behaves.

If you are working to scale up a job its good to understand how your memory usage will scale as the job increases. For example, say you are running a three dimensional code and you increase the resolution of the box you are simulating by 2. That means that your memory usage will grow by a factor of 8 because each dimension grew by a factor of 2. Likewise if you are running a simulation that ingests data, it will likely scale linearly with the amount of data you ingest. Testing by increments is the best way to validate how your memory usage will grow depending on situation.

One important warning is to make sure to use the correct memory for each type of job. When your job runs, the scheduler blocks off a segment of memory for you to use, regardless of if you actually use it. If your job asks for 100GB but only uses 1GB, the scheduler will give you 100GB and your fairshare will also be charged for that 100GB. In addition, if you had asked for 1GB your job may have been better able to fit into the gaps in the scheduler, as 1GB of memory is easier to find than 100GB. Efficient use of the cluster means selecting the right amount of memory for whatever job you are running at that time. A quick way to spot if you have any jobs with incorrect memory settings is to use the seff-account command which will plot a histogram of your job memory efficiency over a specified period.

Cores

Slurm does not automatically parallelize jobs. Even if you ask for 1000’s of cores, if your job is not set up to run in parallel, your job will just run on a single core and the other cores will remain idle. Thus when in doubt about your code, err on the side of asking for a single core and then check the code’s documentation or contact the primary author to find out whether it is parallel and what method it uses.

Broadly parallel applications fall into two categories: thread based and rank based. Thread based parallelism relies on a shared memory space and thus is constrained to a single node. This includes things like OpenMP, pthreads, and python multi-processessing. Rank based parallelism relies on individual processes that have their own dedicated memory space which communicate with each other to share information. The main example of this is MPI (Message Passing Interface). It is important to understand which method your job uses as that will make a difference how you ask for resources in Slurm and how many cores you can reasonably ask for.

Once you figure out if your code is thread based or rank based, you can then do a scaling test to figure out how your code behaves as you add more cores. There are two types of scaling tests you can do, both test slightly different parts of your code. The first type is called strong scaling. In this test, you keep the size of the problem the same while increasing the number of cores you use. In an ideal world your job should go twice as a fast every time you double the amount of cores you use. Most codes though do not have ideal scaling. Instead various inefficiencies in the algorithm or the size of the job itself mean that there is a point of diminishing returns where adding more cores does not improve speed. Typically when you plot a chart of strong scaling you will see:

Strong Scaling Plot. Plot is Log-Log with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run their code with more than 256 cores because after that point adding more cores has diminishing returns with respect to improving performance.

The second type is called weak scaling. In this test you increase the size of the job proportional to the number of cores asked for. So if you double the cores, you would double the job size. Job size in this case is the amount of total computational work your job does. For instance you might double the amount of data ingested (assuming your code computational needs increase linearly as you increase the data ingested) or double one of the dimensions of multidimensional grid. In an ideal world, your job should take the same amount of time to run if the job size grows linearly with the core count. Most codes though do not have this ideal scaling. Instead various communications inefficiencies or nonlinear growth in processing time can impact the performance of the job and thus adding more cores would be inefficient beyond a certain point. A typical plot for weak scaling looks like:

Weak Scaling Plot. Plot is Log-Linear with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run this job with more than about 1000 cores as, after that point, the run time grows substantially from the ideal.

Besides these more robust scaling tests, you can get a quick view of your job core usage efficiency by using the seff or seff-account command. Those commands will take the ratio of two numbers. The first number is how much time you actually used of the cores for the job (t_cpu), this is known as the system CPU time. Note that for historical reasons CPU and Core are used interchangably with respect to CPU time. Regardless of the name what is meant is the amount of time that the system detects as being spent computing on a specific set of cores. The second number is your elapsed run time multiplied by the number of cores (t_elapsed*n). If your job scales perfectly, your CPU efficiency (t_cpu/(t_elapsed*n)) will be 100%. If it is less than 100%, then the ratio of that will be roughly the number of cores you should reduce your job by. So say your job uses 8 cores, but you have an efficiency of 50% in seff, then you should reduce your ask to 4 cores instead. This is also a good way to check quickly if your job is scaling at all as if you see your job only using one core you know that either your job is not parallel or alternatively something is wrong and you need to investigate why your job is not scaling.

Time to Science (TtS)

With these two tests you can figure out the maximum number of cores you should ask for. That said, even if your core scales perfectly you will probably not want to ask for the maximum number of cores you can. The reason for this is that the more cores you ask for the longer your job will pend in the queue, waiting for resources to become available. Time to Science (TtS) is the sum of the amount of time your job pends for plus the amount of time your job runs for. You want to minimize both. Counterintuitively, it may be the case that asking for less cores will mean your job will pend for substantially shorter, enough to make up for the loss in the run’s speed.

As an illustration, say your code scales perfectly and your job of 256 cores will take 1 day to run. However it turns out that you will be spending 2 days pending in the queue waiting for your job to launch, thus your total TtS is 3 days. After more investigation you find out that if you ask for 128 cores, your job will take 2 days to run but the scheduler will be able to launch it in 4 hours leading to TtS of 2.25 days. You can see that the 128 core job was “faster” than the 256 core job, simply due to the fact that the 128 core job fit better in the scheduler at that moment.

It should be noted that the scheduler state is fluid and thus one should inspect the queue before submitting. You can test when your job is scheduled to run by adding the --test-only flag to your sbatch line, that will cause the scheduler to print back when it thinks the job will execute. This is a good way of right sizing your job.

Topology

For certain codes, layout on the node (i.e. which cores on which CPU) and cluster (i.e. where the nodes are located relative to each other) matters. In these cases the topology of the run is critical to getting the peak speed out of the job. Without deep knowledge of the code base, it’s hard to know if your code is one of these codes, and in most cases your code is not.

In cases where the topology of the run matters, Slurm provides a number of options to require the scheduler to give you a certain layout for the job. Both the sbatch and srun commands have options to this effect. Note that the more constraints you add to a job the longer it will take the scheduler to find resources for your job to use. One should set the minimum necessary restrictions on a job to give the scheduler maximum flexibility. As before, it may be the case that you may see a significant speed up if given the right topology but if it comes at the cost of having to wait significantly longer to run, your TtS may actually not improve or even get worse.

GPUs

Many of the same rules that applied to cores also apply to GPU’s. For most codes, your job will use a single GPU. If your code uses multiple GPU’s then you can follow the same process as above for cores to see how your code scales. Note that currently GPU efficiency is not recorded in Slurm. As such you will want to use other tools like DCGM and nvtop to get statistics on how your job is doing.

Time

It should be stated upfront that Slurm does not charge you fairshare for time you do not use. If you ask for 3 days and only use 2 hours, the scheduler will only charge you for the 2 hours you actually used. This is different than Memory, Cores, and GPU’s where you will be charged for allocating those resources whether you use them or not as the scheduler had to block them off for you to use and could not give them to anyone else.

To accurately estimate time is important not for the sake of fairshare, but rather for the sake of scheduling. The scheduler only knows what you tell it; if you tell it that a job takes 3 days, it will assume it takes 3 days even if it really takes 2 hours. Thus when the scheduler goes to consider the job for scheduling, it will look for an allocation block the size of the length of job you request. A more accurate time estimate means that the scheduler can fit your job into tighter spots in the giant game of Tetris it is playing. Taking our previous example, it may be that there are no spots right now for a 3 day job, but a 2 hour job may run immediately because there is a gap that the scheduler can fit it into while waiting to schedule a large high priority job. This behavior is called backfill, and is one of the two loops the scheduler engages in when scheduling. Leveraging the backfill loop is important as it is the main method through which low priority jobs, even those with zero fairshare, get scheduled. You can leap frog ahead of higher priority jobs because your job happens to fill a gap.

Assuming you are running on the same hardware (for considerations regarding different types of hardware see the next section) then you can reliably predict the runtime for certain classes of jobs. Simply run a test job and then look at how long it took using sacct or seff. If you run a bunch of jobs you can use seff-account to see the distribution of run times. Once you have the runtime, round it up the nearest hour and that should cover most situations. Run times can vary for various reasons but typically not more than 10%, so if your job takes 10 hours, you should ask for around 12 hours.

Finally a word about minimum run times. As described above your goal is to minimize Time to Science (TtS). You may naively think that asking for very short amounts of time would decrease TtS even more, but this is incorrect. The scheduler takes time to actually schedule jobs no matter how small your job is. To put it bluntly, you do not want the scheduler doing more work to schedule your job than your actual job is doing. For super short jobs the scheduler can get into a thrashing state where it schedules a job, the job exits immediately, and then the scheduler has to fill that slot again, similar to trying to fill a tub with the drain open. To prevent this, we require jobs to run for at least 10 minutes. Ideally jobs would last for an hour or longer. Thus when you are doing work on the cluster try to make sure you batch in increments of longer than 10 minutes and ideally longer than an hour. This will help the scheduler, and make sure your TtS is as short as possible.

Hardware

For similar job types, the run time is usually the same, with an important caveat being that you need to run on the same hardware. Different types of cores and GPU’s have different capabilities and speeds. It is important to know how your job behaves as you switch between them. We have a table of relative speeds on the Fairshare page. It should be noted that that table only applies if your code is fully utilizing the hardware in question (more on that in the optimization section), you should always test your code to see how it actually performs as certain CPU and GPU types may work better for your code than others despite what the officially advertised benchmarks say. While we generally validate vendor advertised performance numbers, they only apply to heavily optimized codes designed for those specific chips, as such your code speed may vary substantially.

If your are submitting to gpu_requeue or serial_requeue you will notice that your run times will vary quite a bit. This is because gpu_requeue and serial_requeue are mosaic partitions with a wide variety of hardware and thus a wide variety of performance. In cases like that you can either be very specific about which type of hardware you want using the --constraint option, or you can simply increase your time estimate to be the maximum you expect it to take on the slowest hardware available. A good rule of thumb is a factor of three variance in speed. So if your job takes 3 hours on most hardware, give it 9 hours on serial_requeue as you may end up on a substantially slower host.

Storage

The final thing that can impact Job Efficiency is the storage you use. Nothing can drag down a fast code faster than slow IO speed (Input/Output). To select the right storage, please read our Data Storage Workflow page. In general, for jobs you will want to use either Global Scratch or Local Scratch. If your job is IO heavy (i.e. it is constantly talking to the storage), Local Scratch is strongly preferred. Please also see the Data Storage Workflow page for how to best lay out your file structure, as file structure layout can impact job performance as well.

Job Optimization

Now that we have dealt with Job Efficiency the next thing to look at is Job Optimization, after all the only way to improve your Time to Science (TtS) and increase your code capability after properly structuring your job is to improve the code itself. Job optimization can be very beneficial but can also take significant time. There are in general three methods to optimize your code, each taking different amounts of time.

Compiler Version, Library Version, Containers and Optimization: Compilers, Libraries, and Containers are code as well and thus subject to improvement. Simply changing or updating your compiler, libraries, and container can sometimes lead to dramatic increases in performance. In addition compilers have different optimization flags you can use that will automatically optimize your code. This option is the fastest way to get optimized code as all the work is already done for you, you just need to select the right compiler, libraries, container, and options.
Partial Code Rewrite: Looking through the code as it exists now and reworking portions can create speed ups. The process of reworking your code consists of finding places where the code is spending significant time and then refactoring the code; either by updating the logic, replacing the numerical method, or by substituting an optimized library. This process can take a few weeks to months but can give substantial increases in speed. However, this method cannot fix the basic structural problems with the code.
Full Code Rewrite: This can could take a significant amount of time depending on the complexity of the code (for large codes this can take up to six months to a year to complete) but is the best way to optimize your code. It will allow you to fundamentally understand how your code operates and fix any major structural problems resulting in transformative increases in performance. If you go this route you should try the other two options first; as when you do the first two steps you will have a good understanding of the quirks of your code. You should also do a cost benefit analysis to figure out if the time spent is worth the potential gains. Make sure the project has a firm end goal in mind. If your code needs continual improvement, it may be time to hire a Research Software Engineer to do that very important and necessary code development work.

Regardless of the method, you will need to grow more acquainted with your code, its numerical methods, and how it interacts with the underlying hardware. While there are some generalized rules and things to look for when optimizing code, in the end it will depend on you turning your code from a blackbox into something you understand at a fundamental level. This is also where learning how to use various debuggers and code inspectors can be very beneficial as they can help identify which portions of the code to focus on.

Important final note: Always reconfirm the results of your code whenever you change your optimization. This goes for any changes to your code, but especially when you recompile with different compilers, optimization levels, libraries, etc. You should have a standard battery of tests you know the results of that you can run to reconfirm that the results did not change, or if they did they are acceptable changes. Optimization can change the numerical methods and order of operations leading to numerical drift. Sometimes that drift is fine as its at the edge of the mantissa. Sometimes though those changes at the edge of the mantissa can build and lead to substantial changes in results. Even if your code is confirmed as working, always have a healthy suspicion of your code results and engage in independent verification as code bugs and faults can produce results that look legitimate but were arrived at by faulty logic or code.

Below we are going to give some general rules regarding optimization as well as suggestions as to different ways to go about it.

Compiler Optimization

Compiler optimization means letting the compiler look through the code for things it can improve automatically. Maybe it will change the memory layout to make it more optimal, maybe it will notice that you are doing a certain numerical technique and then substitute in a better one, maybe it will change the order of operations to improve numerical speed. Regardless of what it tries, compiler optimization relies on the authors of the compiler and their deep knowledge of numerical methods and the underlying hardware to get improved speed. Compiler optimization really applies to those compiling from C, C++, or Fortran but higher level codes like Python and R, which lean on libraries that are written in C, C++, and Fortran can also benefit. Thus if you want to really optimize your Python or R code, getting those underlying libraries built in an optimized way can lead to speed ups.

There is a generally agreed upon standard for most compilers with regard to level of optimization. After all not all optimizations are numerically safe, or will produce gains in speed. Some may in fact slow things down. As such when using compiler optimization test your code speed and accuracy at different levels of optimization, and with different compilers. Each compiler has a different implementation of the standard, some are better for certain things than others. It is also worth reading the documentation for the compiler optimization levels to see what is included. A good exercise is to take each individual flag that makes up an optimization level and test to see if it speeds up your code and if it introduces numerical issues.

The standard code optimization (-O) levels are:

-O0: Not optimized at all. The compiler just runs your code as is with zero work done. If you turn on debugging typically this is what your code will default to.

-O1: Numerically safe optimization. This level of optimization is guaranteed to be numerical stable and safe. No corners are cut, no compromises in numerical precision are made, nothing is reordered.

-O2: Mostly numerically safe optimization. This level of optimization is the default level for most compilers. At this level, in most cases, the optimizations made are numerically okay. Generally there is no sacrifice of numerical precision, though loops may be unrolled and reordered to make things more efficient.

-O3: Heavily optimized. This level of optimization takes the approach of trying to include every possible optimization whether numerically safe or not.

As you can see the various levels of optimization make certain assumptions about how numerically safe it is trying to be. Given this, you should always test your code to make sure that it runs as it should after compilation and does not produce errant results.

One other common optimization is to leverage special features found with different chipsets. Each generation of CPU has different features built into it that you can leverage. Some example features are SSE (Streaming SIMD Extensions), FMA (Fused Multiply Add), and AVX (Advance Vector Extensions). If your code is architected to use them, you can gain substantial speed by enabling these optimizations. There are three ways to do this:

Turn on each feature individually: This allows you to pick and choose which you want and makes your compiled code portable across different chipsets.
Specify chipset you are building for: Compilers include flags that allow you to target a specific type of chip and include all the relevant optimizations for it. This approach works well if you have a uniform set of hardware you are running on, or if you are not sure what features your code will leverage. Note your code will not work on other chipsets.
Have the compiler autodetect what chipset you are using: Compilers usually have a flag (i.e. -xHost) that will detect the chipset you are currently on and then build specifically for that. To do this properly you will need to make sure you are on the node that is of the same type that you will run your code on. In addition your code will not be portable.

It is worth noting again that not all optimizations are safe or beneficial. Heavily optimized code can lead to substantial bloat in memory usage with little material gain. Numerical issues may occur if the compiler makes bad assumptions about your code. You should only use up to the level of optimization that is stable and beneficial and no higher. If an optimization has no impact on your code performance, it is best to leave it off.

Important final note: Always remove debugging flags and options when running in production. Debugging flags will disable optimization even you tell the compiler to optimize, as the debugging flag overrides the optimization flag. Before going to production, remove debugging flags, recompile, and test your code for accuracy and performance.

Languages, Numerical Method, and Libraries

Selecting the correct language, numerical method, and libraries are important parts of code optimizations. You always want to select the right tool for the job. For some situations Python is good enough, for others you really need Fortran. An improved numerical method may give enormous speed ups but at the cost of increased memory, or vice versa. Swapping out code you wrote for a library maintained by a domain expert may be faster, or having a more integrated code may end up being quicker.

With languages, you are usually locked into a specific one unless you do a complete code rewrite. As such, you should learn the quirks of the language you are using and make sure your code conforms to the best standards and practices for that language. If you are looking to rewrite your code, then consider changing which language you are using. It may be that a different language may lead to more speed ups in the future. As a general rule, languages that are closer to the hardware (things like C or Fortran) can be made to go faster, but they also are trickier to use.

For numerical methods you will want to stay abreast of the current literature in your field and the field that generated the relevant numerical technique (e.g. matrix multiplication, sorting). Even small changes to a numerical technique can add up to large gains in speed. They can also dramatically impact memory utilization. Simplicity is also important, as in many cases a simpler method is faster just by dint of having to do less math and logic. This is not true in all cases though, so be sure to test and verify.

Libraries are another important tool in the toolbox. By using a library you leverage someone else’s time and experience to write optimized code. This saves you from having to debug the code and optimize it, you simply plug in the library and go. Libraries can still have flaws though, so you want to make sure you keep up to date and test. If you do find flaws you should contribute back to the community (i.e., report to the library’s developers) so that everyone benefits from the improvement you suggest. One other caution with libraries is that sometimes it is better to inline the code rather than go to a library, as the gains from using the library may not outweigh the cost of accessing the library. Libraries will not automatically make your code faster, but rather are a tool you can use to potentially get more speed and efficiency.

Containers

Probably the ultimate form of library is a well maintained container. Well optimized containers have the advantage of providing a highly customized stack of optimized libraries that will allow the code to get to near its peak performance. Containers are powerful tools for especially complex software stacks, as the container can provide optimization for each individual element of the stack and ensure that all the various versions interoperate properly. Containers are not free though and do have some performance overhead, due to having an abstraction layer between the software in the container and the system hardware. For peak performance you will want to build your software stack outside of the container in the native environment. In most cases though, the performance penalties are minimal and substantially outweighed by the performance gains of using a well maintained and well optimized software stack that you do not have to build yourself. For best performance look for containers that are provided by the hardware vendor. For instance Nvidia provides a well curated list of containers built for its GPU’s. The vendor typically has the best knowledge of the internals of their hardware and thus will know how to get the most out of it. Containers provided by primary code authors are also good sources as the code author will have the best knowledge of the internals of their code base and how to best run it.

Containers can also be handy for users dealing with operations or code bases involving many files. By including these files in the container itself it effectively hides them from the underlying storage, the storage treats it as one large single file rather than lots of smaller files. Filesystems in general behave best when interacting with single large files as traversing between files is expense, especially when there are many files to deal with. Thus if your workflow either has a software that has many files, like a Python/Anaconda/Mamba environment, or your code is engaging in IO with many files, consider putting them inside a container.

Other General Rules

Here are some rules that did not fit into other sections but are things you can look for when optimizing your code.

Remove Debugging Flags and Options: Cannot emphasize this enough. Production code should not be run in debug mode as it will slow things down substantially.
Use the latest compilers and libraries: Implied above, but one of the first things to try is updating your compiler and library versions to see if the various improvements to those codes improve your code performance.
Leave informative comments in your code: Comments are free and having good comments can help you understand your code and improve it. A very good practice is to cite the paper and specific equation or analysis you are using so you can find the original context.
Make sure your loops are appropriately ordered for your arrays: Different languages have different array ordering as to which index is fastest to traverse in memory (for instance Fortran orders its arrays with the first index being fastest, in C it is the opposite). Be aware of this and arrange your arrays and loops appropriately.
Avoid if statements buried in loops: if statements are not free and cost time to execute, thus it is best if you can execute it once rather than all the time.
Use temporary variables to hold constants: Multiplications are faster than divisions or exponents. Thus, instead of pi/2, use 0.5*pi; instead of 5^2, use 5*5. In addition, if you have a complicated coefficient you are multiplying or dividing by repeatedly, consider calculating that coefficient once and storing that as a temporary variable. If your coefficients are related to each other by some constant value, also consider making that a constant. For instance if you are always using 4*pi/3, store that as a variable, and then use that in place of it where ever it appears.
Use the right type, size, and level of precision for variables: Integer math is faster than floating point math. Single precision math is faster than double precision. 4 byte integers use up less space than 8 byte integers. Select the size and type necessary for the numerical precision and accuracy you require and no larger.
If you have a heavy arithmetic section consider using small temporary arrays for the data you are manipulating: Long strings of math in a single line are hard for the compiler to optimize, and also trend towards mistakes. Consider breaking it up in to smaller chunks that eventually sum up to the total value you need. Be careful of round off error and order of operations issues with this.
Lower your cache miss rate: CPU’s and GPU’s are built with onboard memory (typically called cache), you should try to keep your processing in this onboard memory and only go out to main memory when necessary. Cache is faster to access and generally small, so doing things in smaller chunks that reuse data will be more likely to drop in the cache layer.
Be aware of first touch rule for memory allocation: Memory is typically allocated on an at-need basis, and the further the code needs to search in memory, the worse the performance. Allocate frequently used arrays and variables first.
Reduce memory footprint: As a general rule you want to keep your memory usage to the bare minimum you need. The more temporary arrays and variables you keep the more memory bloat your code will have.
Avoid over abstraction: Pointers are useful, but pointers to pointers to pointers are not. It makes it hard for the compiler to optimize and for you (and anyone that uses your code) to follow the code.
Be specific and well defined: A well structured code is easier to optimize. Declare all your variables up front, allocate your arrays as soon as you can, do not leave the variable types ambiguous.
Work in Memory and Not on Disk: Accessing storage, no matter how fast, takes far more time than accessing memory. Try to only read and write to storage when necessary. If possible spin off a separate process to handle reading and writing to disk so that your main process can continue work.
Avoid large numbers of files: It is better from a IO performance standpoint to have lower numbers of large files on disk rather than many small files. Bundle your data together into larger files that you read from or write to all at once.
Include Restarts/Checkpoints: Include the ability for your code to pick up from where it left off by writing restart/checkpoint data to disk. The restart data should be only what is sufficient to pick up from where your calculation left off. This will allow you to recover from crashes and leverage the requeue partitions. Restarts will also allow you to use partitions with shorter time limits to bridge yourself to a longer run (e.g. using ten 3 day runs to accumulate a 30 day run).

Parallelization

There are limits to how fast you can make any single code run in serial. Once this limit is hit, parallelization needs to be considered. Sometimes this parallelization is trivial, such as launching thousands of jobs at once each with different parameters to do a parameter sweep (this is known as an embarrasingly parallel workflow). However if your code needs to be tightly coupled then other methods of parallelism will need to be considered. The three main methods of parallelization are:

SIMD: Singe Instruction Multiple Data
Thread: Shared Memory
Rank: Distributed Memory

Regardless of what method you use, the general rule is that you want to make sure as much of your code is parallelized as possible and that communications and computation are overlapped with each other. It is also possible to use SIMD in conjunction with Threads in conjunction with Ranks, this is known as the hybrid approach. These can lead to very powerful codes that can scale up to the largest supercomputers in the world.

Some libraries and codes (for example MATLAB, PETSc, OpenFOAM, Python Multiprocessing, HDF5) will already have parallelization included. Check with the documentation and/or inquire of the developer as to if it is able to parallelize and what method it uses. Once you know that you will be able to get the most out of the built-in parallelization.

SIMD (Single Instruction Multiple Data)

Most processors have multiple channels that can execute a specific command simultaneously on a stream of data. This is built in to the chipset itself and compilers will automatically optimize code to leverage this behavior. You can intentionally design your code to better leverage it depending on which specific compiler and instruction set you are using (such as AVX).

Threads

Threading achieves parallelism by having a shared memory space but then running multiple computational streams (threads) across it to accomplish specific instructions. Thread based parallelism is typically fairly easy to accomplish as it requires no complex interprocess communications, all the changes to memory are readily visible to each thread. Typically all the coder needs to do is to indicate which loops and sections can be threaded, and the compiler takes care of the rest. Examples are OpenMP, OpenACC, Pthreads, and Cuda.

Rank

Rank based parallelism is the most powerful but also most technically demanding type of parallelism. Each process has its own memory space and the user has to manage inter process communications themselves. Key here is making sure that communications bottlenecks are minimal, and if they exist to overlap them with computation so they do not slow down code execution. The industry standard for doing this is called MPI (Message Passing Interface).

Profiling

Knowing where to focus your time for optimizing your code is important. You will gain the most speed by optimizing the part of your code that is currently occupying the most execution time, or using the most memory. To figure this out you need to profile your code.

The easiest and most immediate way is to use print statements combined with printing how much time each section takes. Most languages have methods of printing out time stamps or calculating elapsed time, simply use those methods with judicious use of print statements and you can quickly find out where your code is spending most of its time. Generally you should instrument your code to give you overall timing estimates, especially if your code works on some sort of large loop (i.e. such as taking time steps for doing fluid dynamics). Print statements are the quickest and easiest way to get information on your code.

Besides print statements, various profilers exist that you can use to inspect your code. Profilers will give you far more information about your code, as well as suggestions as to where your code could be improved. They can give you super precise timing for your code as well as inform you what cache/memory level your code is touching. All of this rich information can be valuable for dialing in on particularly small sections of code or subtle issues that may be causing dramatic slow downs.

Below is a list of profilers you can use:

VTune: Intel’s profiler
NSight: Nvidia’s profiler
DCGM: Data Center GPU Manager from Nvidia
top: Not really a profiler but a useful system utility for monitoring live job performance.
nvtop: Similar to top but for gpus.

FASRC Guidelines for OpenAI Key and Harvard Agreement

RC Admin — Fri, 17 May 2024 19:10:10 +0000

FASRC allows the use of OpenAI on the Cannon cluster. Our users are free to install the tool locally under their profile on the cluster and provide it with data. However, we are asking our users to be aware of the initial guidelines [sometimes Chrome might not work for these HUIT websites, especially if you are “not” on university VPN. In that case, you can access these websites using Firefox] that the University has put forward for the use of such tools at Harvard.

OpenAI on the cluster

Following is a set of guidelines that we have put together for interested parties to ensure safe usage of OpenAI tools on the cluster and the data that is provided to them.

There are two ways to get started on the process of getting your OpenAI account attached with Harvard’s enterprise agreement.

Option 1 (needs a credit card) – Best executed by the PI

Create a lab/PI/school-based OpenAI account using an email address and password of your choosing that can be easily distributed to other members of the group.
This will generate an OpenAI API key that would need to be stored safely. The key will be required by other members of the account to install and use OpenAI on the cluster.
After creating the account, go to your profile and click on Settings.
Look for Organization ID on that page.
Copy that ID and send an email using the template below to ithelp@harvard.edu or generativeAI@fas.harvard.edu to get this associated with the enterprise agreement that Harvard has with OpenAI. It could take up to a week for the association to take place.
This would ensure that whatever data is provided to OpenAI stays within that agreement and is not made public by the company.
Attach a credit card with this account that will be used for billing. This could be the PI or division’s C-card.
Once the newly created OpenAI account has been associated with Harvard’s agreement and a credit card has been attached to it, the PI or the manager of this OpenAI account can now add members/students by going to the Settings page and inviting new members to the group.
The OpenAI API key will also have to be shared with new members.
At the point, any member of that account is now ready to install OpenAI on the cluster using the instructions on https://github.com/fasrc/User_Codes/tree/master/AI/OpenAI
The member will also need to be made aware of the data classification level that is attached to a certain GAI tool and its subsequent use on the cluster.

Note: FASRC has tested using the OpenAI key for installing and using the corresponding software with Option# 1.

Email Template

Subject: Request to Associate OpenAI Organization ID with Harvard Enterprise Agreement

Dear HUIT GenAI Support Team,

I am requesting assistance in associating my OpenAI Organization ID with Harvard University’s Enterprise Agreement (EA) with OpenAI.

As a Harvard affiliate, I want to utilize the OpenAI APIs, which offer increased API rate limits and the ability to use level 3 data, under the terms of the EA established by Harvard.

Organization ID:

My Organization ID is [Your Organization ID Here]. You can find this ID in the OpenAI API portal under Settings -> Organization (platform.openai.com).

I understand that once my Organization ID is submitted, the association with the Enterprise Agreement could take up to a week to be confirmed by OpenAI. Additionally, I acknowledge that API-related billing will be charged to the credit card on record with my OpenAI account, as OpenAI does not support PO/invoice billing.

Regards,

Option 2 (HUIT recommended – needs 33-digit billing code) – Best executed by the PI

Individual Account:

Go to Harvard’s API portal.
Click on either of the two options:
1. AI Services – OpenAI Direct API
2. Or AI Services – OpenAI via Azure
Follow the instructions given on the corresponding page.
For example, for either of the options, one will have to fill out the HUIT billing form for new customers located on HarvardKey – Harvard University Authentication Service to obtain a customer account number followed by registering the app using the API portal’s Guides – Register an App | prod-common-portal
Once the app is registered, you should be able to receive the corresponding API key (not the OpenAI key) from HUIT.
The API key is already associated with the enterprise agreement that Harvard has with OpenAI, so need to get that associated by sending HUIT an email as mentioned in Option# 1.
Use this API key to install OpenAI on the cluster using the instructions on https://github.com/fasrc/User_Codes/tree/master/AI/OpenAI

Team Account:

This feature on the API portal allows developer teams to “own” an API consumer app, instead of individuals. See Guides – Create a Team | prod-common-portal.

Following are the steps a PI can take to create a team and add members to it so that they can access the API key associated with the “team” account. The owner can manage the team (add new people as time goes on, or remove them). Each member of the team can log into the portal to access the API key for their app.

Please be sure to enter the email addresses carefully in the team setup, and they should be the email addresses associated with their HarvardKey.

Create the team and list each developer.
Register the app and select the desired team as the app owner.
In the app registration, select the APIs you want access to.

Note: FASRC has not verified the use of API key for installing the OpenAI software on the cluster with Option# 2.

Reminder: All our users are allowed to work with data classified as Level 2 on Cannon and Level 3 on FASSE. The member will also need to be made aware of the data classification level that is attached to a certain GAI tool and its subsequent use on the cluster. The university has guidelines on what sort of data can and cannot be exposed to third party systems. Please see: Guidelines for Using ChatGPT and other Generative AI tools at Harvard.

Resources:

VSCode Remote Development via SSH and Tunnel

RC Admin — Wed, 10 Apr 2024 16:35:35 +0000

This document provides the necessary steps needed to setup a remote connection between your local VS Code and the Cannon cluster using two approaches: SSH and Tunnel. These options could be used to carry out the remote development work on the cluster using VS Code with seamless integration of your local environment and cluster resources.

Important:
We encourage all our users to utilize the Open On Demand (OOD) web interface of the cluster to launch VS Code when remote development work is not required. The instructions to launch VS Code using the Remote Desktop App are given here.

Prerequisites

Recent version of VS Code installed for your local machine.
Remote Explorer and Remote SSH extensions installed, if not already present by default

FASRC Recommendation

Based on our internal evaluation of the three approaches, mentioned below, and the interaction with our user community, to launch VSCode on a compute node, we recommend our users utilize Approach I: Remote – Tunnel via batch job over the other two. The Remote – Tunnel via batch job approach submits a batch job to the scheduler on the cluster, thereby providing resilience toward network glitches that could disrupt VSCode session on a compute node if launched using Approach II or III.

Note: We limit our users to a maximum of 5 login sessions, so be aware of the number of VSCode instances you spawn on the cluster.

Approach I: Remote – Tunnel via batch job

Note: The method described here and in Approach II will launch a single VS Code session at a time for a user on the cluster. The Remote – Tunnel approaches do not support concurrent sessions on the cluster for a user.

In order to establish a remote tunnel between your local machine and that of the cluster, as an sbatch job, execute the following steps.

Copy the vscode.job script below.
vscode.job script:

#!/bin/bash#SBATCH -p test # partition. Remember to change to a desired partition#SBATCH --mem=4g # memory in GB#SBATCH --time=04:00:00 # time in HH:MM:SS #SBATCH -c 4 # number of coresset -o errexit -o nounset -o pipefailMY_SCRATCH=$(TMPDIR=/scratch mktemp -d)echo $MY_SCRATCH#Obtain the tarball and untar it in $MY_SCRATCH location to obtain the#executable, code, and run it using the provider of your choice
curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' | tar -C $MY_SCRATCH -xzf -VSCODE_CLI_DISABLE_KEYCHAIN_ENCRYPT=1 $MY_SCRATCH/code tunnel user login --provider github#VSCODE_CLI_DISABLE_KEYCHAIN_ENCRYPT=1 $MY_SCRATCH/code tunnel user login --provider microsoft#Accept the license terms & launch the tunnel$MY_SCRATCH/code tunnel --accept-server-license-terms --name cannontunnel

vscode.job script uses the microsoft provider authentication using HarvardKey. If you would like to change the authentication method to github, substitute microsoft -> github.
Submit the job from a private location (somewhere that only you have access to, for example your $HOME directory) from which others can’t see the log file
```
$ sbatch vscode.job
```

Look at the end of the output file

$ tail -f slurm-32579761.out
...
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ABCDEFGH to authenticate.
Open a web browser, enter the URL, and the code. After authentication, wait a few seconds to a minute, and print the output file again:
$ tail slurm-32579761.out
*
* Visual Studio Code Server
*
* By using the software, you agree to
* the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
* the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
*
Open this link in your browser https://vscode.dev/tunnel/cannon/n/home01/jharvard/vscode

Now, you have two options
1. Use a web client by opening vscode.dev link from the output above on a web browser.
2. Use vscode local client — see below

Using vscode local client (option #2)

In your local vscode (in your own laptop/desktop), add the Remote Tunnel extension (ms-vscode.remote-server)
1. On the local VSCode, install Remote Tunnel extension
2. Click on VS Code Account menu, choose “Turn on Remote Tunnel Access”
Connect to the cluster:
1. Click on the bottom right corner
2. Options will appear on the top text bar
3. Select “Connect to Tunnel…”
Then choose the authentication method that you used in vscode.job, microsoft or github
Click on the Remote Explorer icon and pull up the Remote Tunnel drop-down menu
Click on cannontunnel to get connected to the remote machine either in the same VS Code window (indicated by ->) or a new one (icon besides ->).
Prior to clicking, make sure you see: Remote -> Tunnels -> cannontunnel running
Finally, when you get vscode connected, you can also open a terminal on vscode that will be running on the compute node where your submitted job is running.

Enjoy your work using your local VSCode on the compute node.

Approach II: Remote – Tunnel interactive

In order to establish a remote tunnel between your local machine and that of the cluster, as an interactive job, execute the following steps. Remember to replace with your FASRC username.

ssh @login.rc.fas.harvard.edu
curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
tar -xzf vscode_cli.tar.gz
An executable, code, will be generated in your current working directory. Either keep it in your $HOME or move it to your LABS folder, e.g.
1. mv code /n/holylabs/LABS/rc_admin/Everyone/
Add the path to your ~/.bashrc so that the executable is always available to you regardless of the node you are on, e.g.,
1. export PATH=/n/holylabs/LABS/rc_admin/Everyone:$PATH
Save ~/.bashrc, and on the terminal prompt, execute the command: source ~/.bashrc
Go to a compute node, e.g.: salloc -p gpu_test --gpus 1 --mem 10000 -t 0-01:00
Execute the command: code tunnel
Follow the instructions on the screen and log in using either your Github or Microsoft account, e.g.: Github Account
To grant access to the server, open the URL https://github.com/login/device and copy-paste the code given on the screen
Name the machine, e.g.: cannoncompute
Open the link that appears in your local browser and follow the authentication process as mentioned in steps# 3 & 4 of https://code.visualstudio.com/docs/remote/tunnels#_using-the-code-cli
Once the authentication is complete, you can either open the link that appears on the screen on your local browser and run VS Code from there or launch it locally as mentioned below.
On the local VSCode, install Remote Tunnel extension
Click on VS Code Account menu, choose “Turn on Remote Tunnel Access”
Click on cannoncompute to get connected to the remote machine either in the same VS Code window (indicated by ->) or a new one (icon besides ->). Prior to clicking, make sure you see:
Remote -> Tunnels -> cannoncompute running

The remote tunnel access should be on and the tunnel should come up as running prior to starting the work on the compute node.

Note: Every time you access a compute node, the executable, code, will be in your path. However, you will have to repeat step#10 before executing step#16 above in order to start a fresh tunnel.

Approach III: Remote – SSH

In order to connect remotely to the cluster using VS Code, you need to edit the SSH configuration file on your local machine.

For Mac OS and Linux users, the file is located at ~/.ssh/config. If it’s not there, then create a file with that name.
For Windows users, the file is located at C:\Users\\.ssh\config. Here, refers to your local username on the machine. Same as above, if the file is not present, then create one.

There are two ways to get connected to the cluster remotely:

Connect to the login node using VS Code.
Important: This connection must be used for writing &/or editing your code only. Please do not use this connection to run Jupyter notebook or any other script directly on the login node.
Connect to the compute node using VS Code.
Important: This connection can be used for running notebooks and scripts directly on the compute node. Avoid using this connection for writing &/or editing your code as this is a non-compute work, which can be carried out from the login node.

SSH configuration file

Login Node

Adding the following to your SSH configuration file will let you connect to the login node of the cluster only with the Single Sign-On option enabled. The name of the Host here is chosen to be cannon but you can name it to whatever you like, e.g., login or something else. In what follows, replace with your FASRC username.

For Mac:

Host cannon
User
HostName login.rc.fas.harvard.edu
ControlMaster auto
ControlPath ~/.ssh/%r@%h:%p

For Windows:

The SSH ControlMaster option for single sign-on is not supported for Windows. Hence, Windows users can only establish a connection to the login node by either disabling the ControlMaster option or not having that at all in the SSH configuration file, as shown below:

Host cannon
User
HostName login.rc.fas.harvard.edu
ControlMaster no
ControlPath none

Host cannon
User
HostName login.rc.fas.harvard.edu

Compute Node

In order to connect to the compute node of the cluster directly, execute the following two steps on your local machine:

Note: Establishing a remote SSH connection to a compute node via VSCode works only for Mac OS. For Windows users, this option is not supported and we recommend they utilize the Remote-Tunnel Approaches I or II for launching VSCode on a compute node.

Generate a pair of public and private SSH keys for your local machine, if you have not done so previously, and add the public key to the login node of the cluster:
In the ~/.ssh folder of your local machine, see if id_ed25519.pub is present. If not, then generate private and public keys using the command:

ssh-keygen -t ed25519 -b 4096

Then submit the public key to the cluster using the following command:

ssh-copy-id -i ~/.ssh/id_ed25519.pub @login.rc.fas.harvard.edu

This will append your local public key to ~/.ssh/authorized_keys in your home directory ($HOME) on the cluster so that your local machine is recognized.
Add the following to your local ~/.ssh/config file by replacing with your FASRC username. Make sure that the portion for connecting to the login node from above is also present in your SSH configuration file. You can edit the name of the Host to whatever you like or keep it as compute. There are two ProxyCommand examples shown here to demonstrate how the ProxyCommand can be used to launch a job on a compute node of the cluster with a desired configuration of resources through the salloc command. Uncommenting the first one will launch a job on the gpu_test partition of the Cannon cluster whereas uncommenting the second one will launch it on the test partition.

Host compute
UserKnownHostsFile=/dev/null
ForwardAgent yes
StrictHostKeyChecking no
LogLevel ERROR
# substitute your username here
User
RequestTTY yes
# Uncomment the command below to get a GPU node on the gpu_test partition. Comment out the 2nd ProxyCommand
#ProxyCommand ssh -q cannon "salloc --immediate=180 --job-name=vscode --partition gpu_test --gres=gpu:1 --time=0-01:00 --mem=4GB --quiet /bin/bash -c 'echo $SLURM_JOBID > ~/vscode-job-id; nc \$SLURM_NODELIST 22'"

# Uncomment the command below to get a non-GPU node on the test partition. Comment out the 1st ProxyCommand
ProxyCommand ssh -q cannon "salloc --immediate=180 --job-name=vscode --partition test --time=0-01:00 --mem=4GB --quiet /bin/bash -c 'echo $SLURM_JOBID > ~/vscode-job-id; nc \$SLURM_NODELIST 22'"

Note: Remember to change the Slurm directives, such as --mem, --time, --partition, etc., in the salloc command based on your workflow and how you are planning to use the VSCode session on the cluster. For example, if the program you are trying to run needs more memory, then it is best to request that much amount of memory using the --mem flag in the salloc command prior to launching the VSCode session on the cluster otherwise it could result in Out Of Memory error.

Important: Make sure to pass the name of Host being used for the login node to the ProxyCommand for connecting to a compute node. For example, here, we have named the Host as cannon for connecting to the login node. The same name, cannon is then being passed to the Proxycommand to establish connection to a compute node via ssh. Passing any other name to Proxycommand ssh -q would result in a connection not being established error.

SSH configuration file with details for establishing connection to the login (cannon) and compute (vscode/compute) node.

Once the necessary changes have been made to the SSH configuration file, open VS Code on your local machine and click on the Remote Explorer icon on the bottom left panel. You will see two options listed under SSH – cannon and compute (or whatever name you chose for the Host in your SSH configuration file).

Option to connect to the login (cannon) or compute (vscode) node under SSH after clicking on the Remote Explorer icon.

Connect using VS Code

Login Node

Click on the cannon option and select whether you would like to continue in the same window (indicated by ->) or open a new one (icon next to ->). Once selected, enter your 2FA credentials on the VS Code’s search bar when prompted. For the login node, a successful connection would look like the following.

Successful connection to the login node showing $HOME under Recent, nothing in the output log, and the Status bar on the lower left corner would show SSH:cannon.

Compute Node

In order to establish a successful connection to Cannon’s compute node, we need to be mindful that VS Code requires two connections to open a remote window (see the section “Connecting to systems that dynamically assign machines per connection” in VS Code’s Remote Development Tips and Tricks). Hence, there are two ways to achieve that.

Option 1

First, open a connection to cannon in a new window on VS Code by entering your FASRC credentials and then open another connection to compute/vscode on VS Code either as a new window or continue in the current window. You will not have to enter your credentials again to get connected to the compute node since the master connection is already enabled through the cannon connection that you initiated earlier on VS Code.

Successful connection to the compute node with the Status bar showing the name of the host it is connected to and under SSH, “connected” against that name.

Option 2

If you don’t want to open a new connection to cannon, then open a terminal on your local machine and type the following command, as mentioned in our Single Sign-on document, and enter your FASRC credentials to establish the master connection first.

ssh -CX -o ServerAliveInterval=30 -fN cannon

Then open VS Code and directly click on compute/vscode to get connected to the compute node. Once a successful connection is established, you should be able to run your notebook or any other script directly on the compute node using VS Code.

Note: If you have a stale SSH connection to cannon running in the background, it could pose potential problems. The session could be killed in the following manner.

$ ssh -O check cannon
Master running (pid=#)
$ ssh -O exit cannon
Exit request sent.
$ ssh -O check cannon
Control socket connect(): No such file or directory

Add Folders to Workspace on VSCode Explorer

Once you are able to successfully launch a VSCode session on the cluster, using one of the approaches mentioned above, you might need to access various folders on the cluster to execute your workflow. One can do that using the Explorer feature of VSCode. However, on the VSCode remote instance, when you click on Explorer -> Open Folder, it will open $HOME, by default, as shown below.

In order to add another folder to your workspace, especially housed in locations such as netscratch, holylabs, holylfs, etc., do the following:

Type: >add on VSCode Search-Welcome bar and choose Workspaces: Add Folders to Workspace... . See below:
If you would like to add your folder on netscratch or holylabs or some such location, first open a terminal on the remote instance of VSCode and type that path. Copy the entire path and then paste it on the Search-Welcome bar. See below:Do not start typing the path in the Search-Welcome bar, make sure to copy-paste the full path otherwise VS Code may hang while attempting to list all the subdirectories of that location, e.g., /n/netscratch.
Click ok to add that folder to your workspace
On the remote instance, you will be prompted to answer whether you trust the authors of the files in this folder and then Reload.Go ahead and click yes (if you truly trust them), and let the session reload.
Now you will be able to see your folder listed under Explorer as Untitled (Workspace)
This folder would be available to you in your workspace for as long as the current session is active. For new sessions, repeat steps #1-4 to add desired folder(s) to your workspace

Troubleshooting VSCode Connection Issues

Make sure that you are on VPN to get a stable connection.
Match the name being used in the SSH command to what was declared under “Host” in your SSH config file for the login node.
Make sure that the --mem flag has been used in the ProxyCommand in your SSH config file and that enough memory is being allocated to your job. If you already have it, then try increasing it to see if that works for you.
Open a terminal and try connecting to a login and compute node (if on Mac) by typing: ssh (replace Host with the corresponding names used for login and compute nodes). If you get connected, then your SSH configuration file is set properly.
Consider commenting out conda initialization statements in ~/.bashrc to avoid dealing with plausible issues caused due to initialization.
Delete the bin folder from .vscode-server or .vscode &/or removing Cache, CachedData, CachedExtensionsVSIXs, Code Cache, etc. folders. You can find these on Cannon on $HOME/.vscode/data &/or $HOME/.vscode-server/data/.
Check your $HOME quota and remove plausible culprits, such as ~/.cache. See instructions to clear disk space.
Make sure that there are no lingering SSO connections. See the Note at the end of Approach III – Remote SSH section.
Failed to parse remote port: Try removing .lockfiles.
Try implementing Approach I – Remote Tunnel via batch job to see if you are able to launch a Remote Tunnel as an sbatch job in the background to ensure that your work is not getting disrupted by any network glitches.
If you continue to have problems, consider coming to our office hours to troubleshoot this live.

Connecting with WinSCP

RC Admin — Mon, 15 Aug 2022 20:31:19 +0000

While it is somewhat less forgiving that its command line sibling, WinSCP can be used to connect to the FASRC cluster.

Note: The following instructions show version 5.21

Create Login Site Entry

You will first need to create a new Login entry.

Protocol: SCP
Host Name: login.rc.fas.harvard.edu
Port: 22
Username: Your FASRC username (no @fas)
Password: Leave blank so that you are prompted

Change Authentication Options

Before saving, click the Advanced button

In the Advanced settings, un-check “Attempt authentication using Pageant”
Attempt Keyboard Interactive and Respond with a password to the first prompt should be checked

Connect – Password Prompt

Click OK and then click Save.

Click Login on your new entry. You should see a prompt like the following.
Enter your FASRC password.

Connect – Two-Factor Prompt

You should then be prompted for your OpenAuth two-factor code.

Click OK.

scrontab

RC Admin — Tue, 23 Mar 2021 15:35:25 +0000

scrontab can be used to define a number of recurring batch jobs to run on the cluster at a scheduled interval. Much like its namesake, crontab, the scrontab command maintains entries in a file that are executed at specified times or intervals. Simply type scrontab from any cluster node and add your job entries in the editor window that appears, one per line. Then save and exit the editor the same way you would exit vim_[1].

Entries use the same format as cron. For an explanation on crontab entry formats, see the wikipedia page for cron.

This example scrontab entry runs a jobscript in user jharvard’s home directory called runscript.sh at 23:45 (11:45 PM) every Saturday:

45 23 * * 6 /n/home01/jharvard/runscript.sh

Note: Your home directory may be located in a different location. To find the directory you are in, type pwd at the command line.

For more information on scrontab, type man scrontab on any cluster node to see the scrontab manual page.

[1] For those who prefer a different editor, you can precede the scrontab command with the EDITOR variable. For example, if you want to use nano, you could invoke scrontab like so:

EDITOR=nano scrontab

Job efficiency using seff command

admin — Mon, 03 Feb 2020 01:03:07 +0000

You can see your job efficiency by using seff. For example:

[user@boslogin01 home]# seff 1234567 Job ID: 1234567 Cluster: [cluster name] User/Group: user/user_lab State: COMPLETED (exit code 0) Nodes: 8 Cores per node: 64 CPU Utilized: 37-06:17:33 CPU Efficiency: 23.94% of 155-16:02:08 core-walltime Job Wall-clock time: 07:17:49 Memory Utilized: 1.53 TB (estimated maximum) Memory Efficiency: 100.03% of 1.53 TB (195.31 GB/node)

In this job you see that the user used 512 cores and their job ran for 7.5 hours. However their CPUTime is 894 hours which is close to 128*7 hours, which is about 25% of the actual amount of compute they request (i.e. 512*7). If your code is scaling effectively CPUTime (CPU Utilized) = NCPUS * Elapsed (Wall-clock time). If it is not that number will diverge. The best way to test this is to do some scaling tests. There are two styles you can do. Strong scaling is where you leave the problem size the same but increase the number of cores. If your code scales well it should take less time proportional to the number of cores you use. The other is weak scaling where the amount of work per core remains the same but you increase the number of cores, so the size of the job scales proportionally to the number of cores. Thus if your code scales in this case the run time should remain the same.

Typically most codes have a point where the scaling breaks down due to inefficiencies in the code. Thus beyond that point there is not any benefit to increasing the number of cores you throw at the problem. That’s the point you want to look for. This is most easily seen by plotting log of the number of cores vs. log of the runtime.

The other factor that is important in a scheduling environment is that the more cores you ask for the longer your job will pend for as the scheduler has to find more room for you. Thus you need to find the sweet spot where you minimize both your runtime and how long you pend in the queue for. For example it may be the case that if you asked for 32 cores your job would take a day to run but pend for 2 hours, but if you ask for 64 cores your job would take half a day to run but would pend for 2 days. Thus it would have been better to ask for 32 cores even though the job is slower.

We also have an array capable variant of seff called seff-array, that makes it easy to do this analysis for array jobs. There is also seff-account which is good for providing summary information of all your jobs over a period of time.

SSH key error, DNS spoofing message

admin — Thu, 30 Jan 2020 18:58:21 +0000

Whenever nodes are updated (for instance, the May 2018 upgrade to CentOS 7 and to Rocky 8 in June 2023), if there is a significant change to them then the SSH key fingerprint is likely to change. As you’ve already stored the fingerprint locally, you will receive a key mismatch error like:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!        @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!

Mac/Linux

To fix this, you will need to remove the key in question from your computer’s local known_hosts file. If you are on a Mac or Linux, you can use the following command from a terminal window on your computer.

ssh-keygen -R login.rc.fas.harvard.edu

If the error was for a specific node, replace ‘login.rc.fas.harvard.edu’ with the full name of that host.

You can now log into the node and will receive an all new request to store the new SSH key.

The example in the screenshot above assumes that your username on your local machine (jharvard, in this case) matches your cluster account username. If this is not the case, you will have to login with your username, explicitly, such as: ssh jharvard@login.rc.fas.harvard.edu

Please note that there are several nodes behind the ‘login.rc.fas.harvard.edu’ hostname, so you may receive the above more than once. Answering yes will allow you to continue.

Alternately, if you primarily only interact with the cluster, you may find it easiest to simply remove the known_hosts file and let it be created from scratch at next login. Mac and Linux users can do so from a terminal on their computer with the following command:

rm ~/.ssh/known_hosts

Windows/PuTTY

PuTTY may prompt you to update the key in place, or it may require updating a registry entry to correct this. If the latter, you will need to remove the known_hosts from the registry:

Open ‘regedit.exe’ by doing a search or by pressing the “Windows Key + R” and type “regedit” and hitting enter or try opening C:\Windows\System32\regedt32.exe
Find HKEY_CURRENT_USER\Software\[your username here]\PuTTY\SshHostKeys
Remove all keys or find and delete the individual key you need to remove
Restart your computer, changes won’t take effect until after a restart.

VDI/OOD

You may also see the error when opening a terminal in VDI/OOD because internally it uses ssh from your FASRC account. You can clear known_hosts on your FASRC account by:

Log in from your local computer using one of the methods here: https://docs.rc.fas.harvard.edu/kb/terminal-access/
ssh-keygen -R login.rc.fas.harvard.edu (or follow the Mac/Linux instructions above)

Fairshare and Job Accounting

admin — Wed, 16 Oct 2019 14:49:19 +0000

Summary

In order to ensure that all research labs get their fair share of the cluster and to account for differences in hardware being used, we utilize Slurm’s built-in job accounting and fairshare system. Every lab has a base Share of the community-wide system, which is governed by the Gratis Share purchased by the Faculty of Arts and Science and distributed equally to all labs. In addition, Shares purchased by individual labs by buying hardware are added to their base Share. The Fairshare score of a lab is then calculated based off of their Share versus the amount of the cluster they have actually used. This Fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual labs from monopolizing the resources, thus making it unfair to labs who have not used their fairshare for quite some time. Currently, we account for the fraction of the compute node used with CPU, GPU, and Memory usage using Slurm’s Trackable RESources (TRES).

What is Fairshare?

Fairshare is a portmanteau that pretty much expresses what it is. Essentially fairshare is a way of ensuring that users get their appropriate portion of a system. Sadly this term is also used confusingly for different parts of fairshare. This includes what fraction of the system users get, the score that the system assigns for users based off of your usage, and the priority that users are assigned based off of their usage. For the sake of the discussion below, we will use the following terms. Share is the portion of the system users have been granted. Usage is the amount of the system users have actually used. Fairshare score is the value the system calculates based off of user’s usage. Priority score is the priority assigned based off of the user’s fairshare score.

While Fairshare may seem complex and confusing, it is actually quite logical once you think about it. The scheduler needs some way to adjudicate who gets what resources. Different groups on the cluster have been granted different resources for various reasons. In order to serve the great variety of groups and needs on the cluster a method of fairly adjudicating job priority is required. This is the goal of Fairshare. Fairshare allows those users who have not fully used their resource grant to get higher priority for their jobs on the cluster, while making sure that those groups that have used more than their resource grant do not overuse the cluster. The cluster is a limited resource and Fairshare allows us to ensure everyone gets a fair opportunity to use it regardless of how big or small the group is.

Trackable RESources (TRES)

Slurm Trackable RESources (TRES) allows the scheduler to charge back users for how much they have used different features on the cluster. This is important as the usage of the cluster factors into the Fairshare calculation. These TRES charge backs vary from partition to partition. You can see what the TRES charge back is by running scontrol show partition and looking at the TRESBillingWeights category.

On Cannon we set TRES for CPU, GPU, and Memory usage. For most partitions we charge back for CPU’s and GPU’s based off of the type being used. We normalize TRES to 1.0 for Intel Cascade Lake chips. For other chips we calculate the TRES by taking the theoretical peak Floating Point OPerations (FLOPs) for a single core of that CPU (or entire GPU) and dividing it by the theoretic peak for the Intel Cascade Lake chips. With this weighting we end up with the following TRES per core:

Processor Type	TRES
Intel Skylake	0.5
AMD Milan	0.5
AMD Genoa	0.6
Intel Sapphire Rapids	0.6
Intel Cascade Lake	1.0
Intel Ice Lake	1.15
Nvidia A40	10
Nvidia V100	75
Nvidia A100	209.1
Nvidia H100	546.9
Nvidia H200	546.9

It may seem to be a penalty to charge more for the Cascade Lake than the Sapphire Rapids, but it really is not in the end. The reason being is that jobs running on the Cascade Lake cores will run roughly 40% faster than the Sapphire Rapids chips. Thus the actual charge back to the user should be the same on a per job basis, it’s just a question of picking the right resource for the job you are running.

In the case of memory we set the TRES based off of the following formula NumCore*CoreTRES/TotalMem where NumCore is the number of cores per node, CoreTRES is the TRES score for that type of core, and TotalMem is the total available memory for the node. The reason we weight memory like this is that if a user uses up all the memory on the node the scheduler cannot schedule another job on that node even if there are available cores. The opposite is also true, if all the cores are used up the scheduler cannot schedule another job there even if there is free memory. Thus memory and CPU are exhaustible resources that impact each other. The above weighting allows us to ensure that memory costs the same as the CPU’s on a given node. For instance, lets say you have a node that has 128 GB of RAM and 32 Intel Cascade Lake cores. In this case every 4 GB of RAM used should be equivalent to a single core being used. Thus we should charge a TRES of 1.0 for 4 GB used, or 0.25 for every GB used. In the case of a Intel Sapphire Rapids node with 32 cores and 128 GB of RAM, you have the same scenario but now the Sapphire Rapids chips are worth 40% less, thus the memory also is worth 40% times less as so it is 0.15 for every GB used.

There is two exceptions to the above TRES rules and those are the requeue partitions, such as serial_requeue and gpu_requeue and the test partitions. For the requeue partitions, since jobs in these partitions can be interrupted by higher priority jobs at any time, this means that there could be a loss of computation time. This is especially true for jobs who are not able to snapshot their progress and restart from where they left off. Studies have shown that to make this type of model break even in terms of cost you need to charge back roughly half of what you normally would. So for the requeue partitions we charge a flat rate of 0.5 for CPU, 104.6 for GPU, and 0.125 per GB for Memory. Since the requeue partitions contain all our hardware, users can get access to normally very high cost CPU’s and GPU’s for cheaper. Thus if a user needs to run a lot of jobs the best way to optimize throughput and usage is to build their jobs to leverage the cheap resources in the requeue partitions. One should be aware though that the available cores in this partition vary wildly depending on how active any given primary partition is.

The other exception are the test partitions, such as test and gpu_test. These partitions are exempted from normal fairshare accounting. This allows users to use these partitions for interactive work, code development, and workflow testing prior to running on the production partitions without fear of exhausting their allocation.

To calculate the amount of TRES usage for a job one would calculate this equation:

Usage = Runtime * (CoreTRES*CoreAlloc + MemTRES*MemAlloc + GPUTRES*GPUAlloc)

Where Runtime is the amount of time the job runs for, Core/Mem/GPUTRES are the TRES weights, and Core/Mem/GPUAlloc are how many resources were allocated. The scalc calculator also has an option for computing the expected usage for a job.

On Cannon each user is associated with their primary group. This lab group is what is called an Account in Slurm. Users belong to Accounts, and Accounts have Shares granted to them. These Shares determine how much of the cluster that group has been granted. Users when they run are charged back for their runs against the Account (i.e. lab) they belong to.

Shares granted an Account come in three types that are summed together. The first type is the Gratis Share. This Gratis Share is the Share given to all labs that are part of the cluster owing to the investment that Research Computing, via the Faculty of Arts and Sciences, has made in Cannon. This Gratis Share is calculated by summing the CPU and GPU TRES for all the nodes in the public partitions, excepting the requeue partitions, and then dividing by the total number of Accounts on Cannon. Thus the Gratis Share roughly corresponds to the number of cores each group has been granted. Currently the Gratis Share is set to 200 for Cannon and 100 for FASSE.

The second type of Share is Lab Share. This Share is the Share given to those Labs who have purchased hardware for their own lab. The CPU and GPU TRES from that purchased hardware is summed and added to the Gratis Share for that Lab’s Account.

The third type of Share is Communal Partition Share. This Communal Partition Share is the Share given to labs who have gone in with other labs and have purchased hardware to be used in common by the group of labs (e.g. a partition for the entire department, or for a school, or a collaboration of labs). In these cases the CPU and GPU TRES is summed and then divided amongst the labs, per their discretion, and added to the Lab’s Account.

Thus the total Share an Account has is simply the addition of all of these types of Share. This Share is global to the whole cluster. So whether the Lab is running on their own dedicated partitions or on the public partitions, their Share is the same. The Share is simply the portion of the entire system they have been granted, and can be moved around as needed by the Lab to any of the resources available to them on the cluster.

Fairshare Score

Probably the easiest way to walk through how a Lab’s Fairshare Score is calculated is to explain what the Slurm tool sshare displays. This tool shows you all the components of your Fairshare calculation. Here is an example:

[root@holyitc01 ~]# sshare --account=test_lab -a
Account  User  RawShares NormShares RawUsage  EffectvUsage FairShare
-------------------- ---------- ---------- ----------- -----------
test_lab       244       0.001363   45566082  0.000572     0.747627
test_lab user1 parent    0.001363   8202875   0.000572     0.747627
test_lab user2 parent    0.001363   248820    0.000572     0.747627
test_lab user3 parent    0.001363   163318    0.000572     0.747627
test_lab user4 parent    0.001363   18901027  0.000572     0.747627
test_lab user5 parent    0.001363   18050039  0.000572     0.747627

The Account we are looking at is test_lab. The first line of the sshare output shows the summary for the whole lab, while the subsequent lines show the information for each user. test_lab has been granted 244 RawShares. Each user of that lab has a RawShare of parent, this means that all the users pull from the total Share of the Account and do not have their own individual subShares of the Account Share. Thus all users in this lab have full access to the full Share of the Account.

The next column after RawShares is NormShares. NormShares is simply the Account’s RawShares divided by the total number of RawShares given out to all Accounts on the cluster. Essentially NormShare is the fraction of the cluster the account has been granted, in this case about 0.136%. Given the way we set up giving out RawShares on Cannon, the total number of RawShares should be equivalent to the number of CPU TRES on Cannon, that is 244 Cascade Lake cores.

Following NormShares we have RawUsage. RawUsage is the amount of TRES-sec the Account/User has used. Thus if a user used a single Cascade Lake core for one second, the user’s account would be charged 1 TRES-sec in RawUsage. This RawUsage is also attenuated by the halflife that is set for the cluster, which is currently 3 days. Thus work done in the last 3 days counts at full cost, work done 6 days ago costs half, work done 9 days ago one fourth, and so on. So RawUsage is the aggregate of the Account’s past usage with this halflife weighting factor. The RawUsage for the Account is the sum of the RawUsage for each user, thus sshare is an effective way to figure out which users have contributed the most to the Account’s score.

A quick aside, it should be noted that RawUsage is the sum of all usage including: failed jobs, jobs that are requeued, jobs that ran on nodes that failed, etc. That usage is still counted as part of RawUsage. The reason for this is that it is up to the user to effectively use the time and resources allocated by the scheduler even if that time is cut short for various reasons. We highly recommend users test and verify their codes before running. Users should also ensure their code has checkpointing enabled so that jobs can restart from where they left off in case of node failure. These steps will minimize the effect of various failures on a user’s usage.

The next column is EffectvUsage. EffectvUsage is the Account’s RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the test_lab has used 0.057% of the cluster.

Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula.f = 2^(-EffectvUsage/NormShares) From this one can see that there are five basic regimes for this score which are as follows:

1.0: Unused. The Account has not run any jobs recently.

1.0 > f > 0.5: Underutilization. The Account is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2

0.5: Average utilization. The Account on average is using exactly as much as their granted Share.

0.5 > f > 0: Over-utilization. The Account has overused their granted Share. For example, when f=0.25 a lab has recently overutilized their Share of the resources 2:1

0: No share left. The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.

Since the usage of the cluster varies, the schedule does not stop Accounts from using more than their granted Share. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus an Account is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Account’s Fairshare score, but allow jobs for the Account to still start. Eventually, another Account with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share. Fairshare only recovers as a lab reduces the workload to allow other Accounts to run. The half-life helps to expedite this recovery.

Given this behavior of Fairshare, Accounts can also bank time for large computations that are beyond their average Share. For instance say the Lab knows it has a large parallel run to do, or alternatively a deadline to meet. The Lab can in preparation for this not run for several weeks. This will drive up their Fairshare as they will have not used their fraction of the cluster for that time period. This banked capacity can then be expended for a large run or series of runs. On the other hand, to continue the financial analogy, a group that has exhausted their Fairshare is in debt to the scheduler as they have used up far more than their granted Share. Thus they have to wait for that debt to be paid off by not running, which allows their Fairshare to recover. Again, when there is no contention for resources, even jobs with low Faishare scores will continue to start.

Job Priority

Now that we have discussed Fairshare we can now discuss how an individual job’s priority is calculated. Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are two components of Job Priority on Cannon. The first is the FairShare score multiplied by a weighting factor to turn it into an integer, in this case 20,000,000. A Fairshare of 1 would give a priority of 20,000,000, while a Fairshare of 0.5 would give a value of 10,000,000. We pick large numbers so we have resolution to break ties between Accounts that are close in Fairshare score. This Fairshare Priority evolves dynamically as the Fairshare of the Account changes over time.

The second component is Job Age. This priority accrues over time gaining a maximum value at 3 days on Cannon and 7 days on FASSE. As the job sits in the queue waiting to be scheduled, its priority is gradually increasing due to the Job Age. The maximum possible value for Job Age is 10,000,000. Thus a job that has been sitting for 3.5 days would have a Job Age Priority of 5,000,000. We set the Job Age Priority to a maximum of 10,000,000 so that a job from an Account with a Fairshare of 0 but has been pending for 3 days on Cannon would have the same priority as a job that was just submitted from an Account that has a Fairshare of 0.5. Thus even jobs from Accounts that have low Fairshare will schedule eventually due to the growth in their Job Age Priority.

These two components are summed together to make up an individual Job’s Priority. You can see this calculation for specific jobs by using the sprio command. In addition you can see the Pending queue of a specific partition ordered by job priority by using showq -o -p .

Nice

Slurm provides a way for users to adjust their own priority by defining a nice value. Similar to the unix nice command, this flag allows users to deprioritize certain jobs. Jobs that are deprioritized should have higher nice values than those that are more important. Values for nice can run between 0 and 2147483645, negative values are not allowed.

Multiple Accounts

While most users are fine with having one Account they are associated with, some users do work for multiple Accounts. Slurm does have the ability to associate users with multiple Accounts, which allows users to charge back individual jobs to individual Accounts. Contact Research Computing if you are interested in this feature.

Historic Data

Research Computing keeps track of historic data for usage and Fairshare score. You can see your historic usage by going to the Cannon and FASSE Lab Fairshare pages and selecting the lab you belong to (note: you must be on the FASRC VPN to see it).

scalc

scalc is a calculator available on the cluster for figuring out various questions about fairshare. It includes a calculator for projecting a new Fairshare score based on a new RawShare, a calculator for figuring out how long it will take to restore fairshare, and a calculator for figuring out how much a set of jobs will cost in terms of cluster utilization and fairshare. When asked for to enter an account name, please enter your lab group name (e.g. – jharvard_lab). If you have additional calculations that you would like to see contact us.

FAQ

Q: My lab’s fairshare is low, what can I do?

There are several things that can be done when your fairshare is low:

Do not run jobs: Fairshare recovers via two routes. The first is via your group not running any jobs and letting others use the resource. That allows your fractional usage to decrease which in turn increases your fairshare score. The second is via the half-life we apply to fairshare which ages out old usage over time. Both of these method require not action but inaction on the part of your group. Thus to recover your fairshare simply stop running jobs until your fairshare reaches the level you desire. Be warned this could take several weeks to accomplish depending on your current usage.
Be patient: This is a corollary to the previous point but applies if you want to continue to run jobs. Even if your fairshare is low, your job gains priority by sitting the queue. The longer it sits the higher priority it gains. So even if you have very low fairshare your jobs will eventually run, it just may take several days to accomplish.
Leverage Backfill: Slurm runs in two scheduling loops. The first loop is the main loop which simply looks at the top of the priority chain for the partition and tries to schedule that job. It will schedule jobs until it hits a job it cannot schedule and then it restarts the loop. The second loop is the backfill loop. This loop looks through jobs further down in the queue and asks can I schedule this job now and not interfere with the start time of the top priority job. Think of it as the scheduler playing giant game of three dimensional tetris, where the dimensions are number of cores, amount of memory, and amount of time. If your job will fit in the gaps that the scheduler has it will put your job in that spot even if it is low priority. This requires you to be very accurate in specifying the core, memory, and time usage of your job. The better constrained your job is the more likely the scheduler is to fit you in to these gaps. The seff and seff-account utilities is are great ways of figuring out your job performance. See also our page on improving Job Efficiency.
Leverage Requeue: The requeue partitions are cheaper to run in and have a lot of capacity. You are more likely to find your job pending for a shorter time, even with low fairshare, in those partitions than in the higher demand non-requeue partitions.
Plan: Better planning and knowledge of your historic usage can help you better budget your time on the cluster. The cluster is not an infinite resource. You have been allocated a slice of the cluster, thus it is best to budget your usage so that you can run high priority jobs when you need to. We at FASRC are happy to consult with you as to how to best budget your usage. Tools like scalc, seff, seff-array, and the historic usage graphs are invaluable assets for this. Beyond that doing analysis of your code efficiency and memory usage will help dramatically. Most users vastly over estimate how much memory their job actually needs, dragging down their fairshare score over time. Trimming these excess requests makes for more efficient usage. Increasing code efficiency by taking time to optimize your code base can also be very beneficial as better, more efficient algorithms mean lower usage and therefore better fairshare.
Purchase: If your group has persistent high demand that cannot be met with your current allocation, serious consideration should be given to purchasing hardware for the cluster. This is not an immediate solution to the problem as it takes time for hardware to be built and installed. That said once the hardware arrives your Share will be increased and your fairshare will improve commensurately. Please contact FASRC for more information if you wish to purchase hardware for the cluster.

Command line access with Terminal (login nodes)

admin — Fri, 31 Aug 2018 11:13:08 +0000

Preface

This document describes how to get access to the cluster from the command line. Once you have that access you will want to go to the Running Jobs page to learn how to interact with the cluster.

Do not run your jobs or heavy applications such as MATLAB or Mathematica on the login server. Please use an interactive session or job for all applications and scripts beyond basic terminals, editors, etc. The login servers are a shared, multi-user resource. For graphical applications please use Open OnDemand.

Please note: If you did not request cluster access when signing up, you will not be able to log into the cluster or login node as you have no home directory. You will simply be asked for your password over and over. See this doc for how to add cluster access as well as additional groups.

A Note On Shells for Advanced Users: The FASRC cluster uses BASH for the global environment. If you wish to use an alternate shell, please be aware that many things will not work as expected and we do not support or troubleshoot shell issues. We strongly encourage you to stick with BASH as your cluster shell. The module system assumes you are using bash.

Login Nodes

When you ssh to the cluster at login.rc.fas.harvard.edu you get connected to one of our login nodes. Login nodes are split between our Boston and Holyoke datacenters. If you want to target a specific datacenter you can specify either boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke). You can also connect to a specific login node by connecting to a specific host name. Login nodes do not require VPN to access and are accessible worldwide.

Login nodes are your portal into the cluster and are a shared, multi-user resource. As mentioned above, they are not intended for production work but rather as a gateway. Users should submit jobs to the cluster for production work. For interactive work you should spawn an interactive job on a compute node. If you need graphical support we highly recommend using Open OnDemand.

We limit users to 1 core and 4GB of memory per session and a maximum of 5 sessions per user. Users abusing the login nodes may have their login sessions terminated. In order to clear out stale sessions the login nodes are rebooted as part of our monthly maintenance.

If you need more than 5 sessions, consider adapting your workflow to rely more on submitting batch jobs to the cluster rather than interactive sessions, as the cluster is best utilized when users submit work in an asynchronous fashion. Using Open OnDemand is also a good option as it gives you a traditional desktop on the cluster with ability to open multiple terminals on a dedicated compute node. There are also tools like screen or tmux which can allow one session to expand to multiple subscreens.

Connecting via SSH

For command line access to the cluster, connect to login.rc.fas.harvard.edu using SSH (Secure SHell). If you are running Linux or Mac OSX, simply open a terminal and type ssh USERNAME@login.rc.fas.harvard.edu, where USERNAME is the name you were assigned when you received your account (example: jharvard – but not jharvard@fasrc, that is only necessary for VPN). If you are on Windows, see below for SSH client options.

Once connected, enter the password you set after receiving your account confirmation email. When prompted for the Verification code, use the current 6-digit OpenAuth token code.

ssh jharvard@login.rc.fas.harvard.edu

To avoid login issues, always supply your username in the ssh connection as above, since omitting this will cause your local login name at your terminal to be passed to the login nodes.

SSH Clients

MAC/LINUX/UNIX

If you’re using a Mac, the built-in Terminal application (in Applications -> Utilities) is very good, though there are replacements available (e.g. iTerm2).

On Linux distributions, a terminal application is provided by default. For Linux users looking for the iTerm2-like experience, Tilix is popular option.

WINDOWS Clients

If you’re using Windows, you will need to decide what tool to use to SSH to the cluster. Each app behaves differently, but includes some way to enter the server (login.rc.fas.harvard.edu) and select a protocol (SSH). Since there’s no one app and many are used by our community, some suggestions follow.

Terminal

Windows 10+ has ssh built into its standard terminal.

Windows Subsystem for Linux (WSL)

Windows 10+ has the ability to start a miniature Linux environment using your favorite flavor of Linux. From the environment you can use all the normal Linux tools, including ssh. See the Windows Subsystem for Linux documentation for more.

PuTTY

PuTTy is a commonly used terminal tool. After a very simple download and install process, just run PuTTY and enter login.rc.fas.harvard.edu in the Host Name box. Just click the Open button and you will get the familiar password and verification code prompts. PuTTY also supports basic X11 forwarding.

Git BASH

For Windows 10 users Git BASH (part of Git for Windows) is available. It brings not aonly a Git interface, but BASH shell intergration to Windows. You can find more info and download it from gitforwindows.org

MobaXterm

MobaXterm provides numerous remote connection types, including SSH and X11. You can find out more and download it from mobaxterm.mobatek.net There is a free and a paid version and MobaXterm supports X11 forwarding.

XMing (standalone)

XMing is an X11/X Windows application and is a bit more complex. But it’s mentioned here as we do have users who use it for connecting to the cluster. You can find more info at www.straightrunning.com