MPI – FASRC DOCS https://docs.rc.fas.harvard.edu Thu, 27 Feb 2025 15:40:08 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png MPI – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 MPI (Message Passing Interface) & OpenMPI https://docs.rc.fas.harvard.edu/kb/mpi-message-passing-interface/ Fri, 31 Jan 2025 18:49:19 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28106

Introduction

The Message Passing Interface (MPI) library allows processes in your parallel application to communicate with one another by sending and receiving messages. There is no default MPI library in your environment when you log in to the cluster. You need to choose the desired MPI implementation for your applications. This is done by loading an appropriate MPI module. Currently the available MPI implementations on our cluster are OpenMPI and Mpich. For both implementations the MPI libraries are compiled and built with either the Intel compiler suite or the GNU compiler suite. These are organized in software modules.

Installation

MPI has many forms, we’ll list a few here, and also look at User_Codes/Parallel_Computing/MPI.

mpi4py with Python

To use mpi4py you need to load an appropriate Python software module. We have the Anaconda Python distribution from Continuum Analytics. In addition to mpi4py, it includes hundreds of the most popular packages for large-scale data processing and scientific computing.

You can load python in your user environment by running in your terminal:

module load python/2.7.14-fasrc01

For example code, see Parallel_Computing/Python/mpi4py

OpenMPI with GNU Compiler

If you want to use OpenMPI compiled with the GNU compiler you need to load appropriate compiler and MPI modules. Below are some possible combinations, check module spider MODULENAME to get a full listing of possibilities.

# GCC + OpenMPI, e.g.,
module load gcc/13.2.0-fasrc01 openmpi/5.0.2-fasrc01

# GCC + Mpich, e.g.,
module load gcc/13.2.0-fasrc01 mpich/4.2.0-fasrc01

# Intel + OpenMPI, e.g.,
module load intel/24.0.1-fasrc01 openmpi/5.0.2-fasrc01

# Intel + Mpich, e.g.,
module load intel/24.0.1-fasrc01 mpich/4.2.0-fasrc01

# Intel + IntelMPI (IntelMPI runs mpich underneath), e.g.
module load intel/24.0.1-fasrc01 intelmpi/2021.11-fasrc01

For reproducibility and consistency it is recommended to use the complete module name with the module load command, as illustrated above. Modules on the cluster get updated often so check if there are more recent ones. The modules are set up so that you can only have one MPI module loaded at a time. If you try loading a second one it will automatically unload the first. This is done to avoid dependencies collisions.

There are four ways you can set up your MPI on the cluster:

  • Put the module load command in your startup files.
    Most users will find this option most convenient. You will likely only want to use a single version of MPI for all your work. This method also works with all MPI modules currently available on the cluster.

  • Load the module in your current shell.
    For the current MPI versions you do not need to have the module load command in your startup files. If you submit a job the remote processes will inherit the submission shell environment and use the proper MPI library. Note this method does not work with older versions of MPI.

  • Load the module in your job script.
    If you will be using different versions of MPI for different jobs, then you can put the module load command in your script. You need to ensure your script can execute the module load command properly.

  • Do not use modules and set environment variables yourself.
    You obviously do not need to use modules but can hard code paths. However, these locations may change without warning so you should set them in one location only and not scatter them throughout your scripts. This option could be useful if you have a customized local build of MPI you would like to use with your applications.

Video Training

Examples

For associated MPI examples, head over to  User_Codes/Parallel_Computing/MPI.

  • Example 1: Monte-Carlo calculation of π
  • Example 2: Integration of x2 in interval [0, 4] with 80 integration points and the trapezoidal rule
  • Example 3: Parallel Lanczos diagonalization with reorthogonalization and MPI I/O

Resources

]]>
28106
Job Efficiency and Optimization Best Practices https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/ Mon, 24 Jun 2024 18:53:02 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27060 Overview

The art of High Performance Computing is really the art of getting the most out of the computational resources you have access to. This applies to working on a laptop, to working in the cloud, or working on a supercomputer. While this diversity of different systems and environments may seem intimidating, in reality there are some good general rules and best practices that you can use to get the most out of your code and the computer you are on.

As defined in our Glossary, the term job has a broad and a narrow sense. In the broad sense, a job is an individual run of an application, code, or script; and may be used interchangeable with those terms. This includes whether you run it from the command line, cronjob, or use a scheduler. In the narrow sense, a job is an individual allocation for a user by the scheduler. It is usually obvious from the context which is meant.

By Job Efficiency, we mean that the parameters of the job in terms of cpu, gpus, memory, network, time, etc. (refer to Glossary for definitions) are accurately defined and match what the job actually uses. As an example, a job that asks for 100 cores but only uses 1 is not efficient. A job that asks for 100GB and uses 99GB is efficient.  Efficiency is a measure of how well the user has scoped their job so that it can run in the space defined.

Finally Job Optimization means to make the job run at the maximum speed possible with the least amount of resources used. For example, a poorly optimized code may only use 50% of the gpu it was allocated, whereas a well optimized one could use 100% and see acceleration commensurate with that improved usage. Similarly, a poorly optimized code may use 1TB of memory, but a well optimized code may only use 100GB. Optimization is a measure of how well structured a code is numerically, both in terms of algorithm and implementation, so that it can get to the solution in the fastest, most accurate, and most economical way.

Efficiency and Optimization are thus two sides of the same coin. Efficiency is about accurately defining the resources that you will use and optimization is about reducing that usage. Both have the goal of getting the most out of the resources the job is using.

Architecture

Schematic showing how cores, memory, and nodes are arranged on the cluster.
Schematic showing how cores, memory, and nodes are arranged on the cluster.

Before we get into talking about Job Efficiency and Optimization, we should first discuss general cluster architecture. Supercomputers (aka clusters) are essentially a bunch of computers of similar type networked together by a high speed interconnect so that they can act in unison with each other and have a common computational environment. The fundamental building block of a cluster is a node. Each node is composed of a bunch of cores that all talk to the same block of memory. GPU nodes have in addition to cores and memory, GPU’s which can be used for specialized workflows such as machine learning. The nodes are then strung together with a network, typically Infiniband, and then a scheduler is put in front to handle what jobs get what resources.

CPU/GPU Type

Typically a cluster will be made up of a uniform set of hardware. However that is not necessarily the case. At FASRC we run a variety of hardware, spanning multiple generations and vendors. These different types of CPU and GPU have various performance characteristics and features that may have impacts on your job. We will talk about this later but being cognizant of what hardware your job works best on will be important for efficient and optimal use of the cluster.  At FASRC we split up our partitions such that each partition has a uniform set of hardware, unless otherwise noted (e.g. gpu_requeue and serial_requeue). A comprehensive list of available hardware can be found in the Job Constraints section of the Running Jobs page. You can learn more about a specific node’s hardware by running scontrol show node NODENAME.

Network Topology

How the nodes are interconnected on the Infiniband is known as the topology.  Topology becomes very important as you run larger and larger jobs, more on that later. At FASRC we generally follow a Fat Tree Infiniband topology, with a hierarchy of switching and adjacent nodes being close to each other network-wise.

We name our nodes after their location in the datacenter, which makes it easy to figure out which nodes are next to each other both in terms of space and in terms of network. Our naming convention for nodes is Datacenter/Node has a GPU or Not/Row/Pod/Rack/Chassis/Blade. So for example the node holy7c04301 is at our Holyoke Data Center in Row 7 Pod C Rack 04 Chassis 3 Blade 01, the adjacent nodes would be holy7c04212 (which is the last blade on the chassis below Chassis 3 as there are 12 blades per chassis for this hardware) and holy7c04302. Another is example is holygpu8a11404, this node is a GPU node in our Holyoke Data Center in Row 8 Pod A Rack 11 Chassis 4 Blade 4, the adjacent nodes would be holygpu8a11403 and holygpu8a11501 (which is the next blade on the chassis above Chassis 4 as there are only 4 blades per chassis for this hardware). You can see the full topology of the cluster by doing scontrol show topology.

Job Efficiency

The first step in improving job efficiency is understanding your job as it exists today. Understanding means you have a good handle on the resource needs and characteristics of your job, and thus you are able to accurately allocate resources to it thereby improving efficiency. As a general rule, one should always understand the jobs you run regardless of size. This knowledge is both beneficial for right sizing your requested resources, but also for noticing any pitfalls that may occur when scaling the job up.

There are two ways of learning about your job. The first is to have a fundamental understanding of the job you are running. Based on your knowledge of the algorithm, code, job script, and cluster architecture, you know what you should request for core count, gpu count, memory, time, and storage. Knowing your code at this level will allow you to make the most accurate estimates for what you will use.

While a full understanding of your job is ideal, often it is not possible. You may not control the code base, you may just be getting started, or you may not have time to obtain a deep understanding of the job. Even in cases where you have a good theoretical understanding of your job, you need to confirm that knowledge with hard data. In which case the second method is to test your job empirically and find out what the best job parameters are. Simply take an example that you know will be akin to what you will run in production and run it as a test job. Then once the test job is done, check to see how it performed. You then repeat, changing the job parameters until you have a good understanding of how your job performs in different situations.

That’s the rough sketch, but the details are a bit different depending on what you want to understand. Below are some methods for finding out how much memory, cores, gpus, time, and storage your job will need. These may not cover every job but should work for most situations.

Memory

Memory on the cluster is doled out in two different ways, either by node (--mem) or by core (--mem-per-cpu). If your job exceeds its memory request, you will see an error either containing Out of Memory or oom. This indicates that the scheduler terminated your job for exceeding your memory request. You will need to increase your memory allocation to give your job more space.

Here is a test plan for figuring out how much memory you should request:

  1. Come up with an initial guess as to how much memory your job will require. A good first guess is usually 2-3 times the size of any data files you are reading in, or 2-3 times the size of the data you will be generating. If you do not know either of those then a safe initial guess is 4GB. Most of the cluster has 4GB per core, so its a good initial guess that will allow you to get through the scheduler in a quick manner.
  2. Run a test job on the test partition with your guess.
  3. Check the result of your run using seff or sacct. (Note: GPU stats (usage and onboard GPU memory) are not given by either of these commands, only CPU usage and CPU memory)
  4. If your job ran out of memory then double the amount and return to step 2.  If it ran properly (i.e, no out of memory error), then look at how much your job actually used and update your request to match with an additional 10% buffer as the scheduler’s sampling of the memory usage runs every 30 seconds and it may have missed any short term memory spikes.

Every time you change a parameter in your code, you should check to see how the memory changes. Some parameters will not change the memory usage at all. Others will change it dramatically.  If you do not know if a parameter will change the memory usage run a test to see how it behaves.

If you are working to scale up a job its good to understand how your memory usage will scale as the job increases. For example, say you are running a three dimensional code and you increase the resolution of the box you are simulating by 2. That means that your memory usage will grow by a factor of 8 because each dimension grew by a factor of 2. Likewise if you are running a simulation that ingests data, it will likely scale linearly with the amount of data you ingest. Testing by increments is the best way to validate how your memory usage will grow depending on situation.

One important warning is to make sure to use the correct memory for each type of job. When your job runs, the scheduler blocks off a segment of memory for you to use, regardless of if you actually use it. If your job asks for 100GB but only uses 1GB, the scheduler will give you 100GB and your fairshare will also be charged for that 100GB. In addition, if you had asked for 1GB your job may have been better able to fit into the gaps in the scheduler, as 1GB of memory is easier to find than 100GB. Efficient use of the cluster means selecting the right amount of memory for whatever job you are running at that time. A quick way to spot if you have any jobs with incorrect memory settings is to use the seff-account command which will plot a histogram of your job memory efficiency over a specified period.

Cores

Slurm does not automatically parallelize jobs. Even if you ask for 1000’s of cores, if your job is not set up to run in parallel, your job will just run on a single core and the other cores will remain idle. Thus when in doubt about your code, err on the side of asking for a single core and then check the code’s documentation or contact the primary author to find out whether it is parallel and what method it uses.

Broadly parallel applications fall into two categories: thread based and rank based. Thread based parallelism relies on a shared memory space and thus is constrained to a single node. This includes things like OpenMP, pthreads, and python multi-processessing. Rank based parallelism relies on individual processes that have their own dedicated memory space which communicate with each other to share information. The main example of this is MPI (Message Passing Interface). It is important to understand which method your job uses as that will make a difference how you ask for resources in Slurm and how many cores you can reasonably ask for.

Once you figure out if your code is thread based or rank based, you can then do a scaling test to figure out how your code behaves as you add more cores.  There are two types of scaling tests you can do, both test slightly different parts of your code. The first type is called strong scaling. In this test, you keep the size of the problem the same while increasing the number of cores you use. In an ideal world your job should go twice as a fast every time you double the amount of cores you use. Most codes though do not have ideal scaling. Instead various inefficiencies in the algorithm or the size of the job itself mean that there is a point of diminishing returns where adding more cores does not improve speed.  Typically when you plot a chart of strong scaling you will see:

Chart showing strong scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Strong Scaling Plot. Plot is Log-Log with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run their code with more than 256 cores because after that point adding more cores has diminishing returns with respect to improving performance.

The second type is called weak scaling. In this test you increase the size of the job proportional to the number of cores asked for. So if you double the cores, you would double the job size. Job size in this case is the amount of total computational work your job does. For instance you might double the amount of data ingested (assuming your code computational needs increase linearly as you increase the data ingested) or double one of the dimensions of multidimensional grid.  In an ideal world, your job should take the same amount of time to run if the job size grows linearly with the core count.  Most codes though do not have this ideal scaling. Instead various communications inefficiencies or nonlinear growth in processing time can impact the performance of the job and thus adding more cores would be inefficient beyond a certain point. A typical plot for weak scaling looks like:

 

Chart showing weak scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Weak Scaling Plot. Plot is Log-Linear with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run this job with more than about 1000 cores as, after that point, the run time grows substantially from the ideal.

Besides these more robust scaling tests, you can get a quick view of your job core usage efficiency by using the seff or seff-account command. Those commands will take the ratio of two numbers. The first number is how much time you actually used of the cores for the job (t_cpu), this is known as the system CPU time. Note that for historical reasons CPU and Core are used interchangably with respect to CPU time. Regardless of the name what is meant is the amount of time that the system detects as being spent computing on a specific set of cores.  The second number is your elapsed run time multiplied by the number of cores (t_elapsed*n). If your job scales perfectly, your CPU efficiency (t_cpu/(t_elapsed*n)) will be 100%. If it is less than 100%, then the ratio of that will be roughly the number of cores you should reduce your job by. So say your job uses 8 cores, but you have an efficiency of 50% in seff, then you should reduce your ask to 4 cores instead. This is also a good way to check quickly if your job is scaling at all as if you see your job only using one core you know that either your job is not parallel or alternatively something is wrong and you need to investigate why your job is not scaling.

Time to Science (TtS)

With these two tests you can figure out the maximum number of cores you should ask for. That said, even if your core scales perfectly you will probably not want to ask for the maximum number of cores you can.  The reason for this is that the more cores you ask for the longer your job will pend in the queue, waiting for resources to become available. Time to Science (TtS) is the sum of the amount of time your job pends for plus the amount of time your job runs for. You want to minimize both. Counterintuitively, it may be the case that asking for less cores will mean your job will pend for substantially shorter, enough to make up for the loss in the run’s speed.

As an illustration, say your code scales perfectly and your job of 256 cores will take 1 day to run.  However it turns out that you will be spending 2 days pending in the queue waiting for your job to launch, thus your total TtS is 3 days. After more investigation you find out that if you ask for 128 cores, your job will take 2 days to run but the scheduler will be able to launch it in 4 hours leading to TtS of 2.25 days. You can see that the 128 core job was “faster” than the 256 core job, simply due to the fact that the 128 core job fit better in the scheduler at that moment.

It should be noted that the scheduler state is fluid and thus one should inspect the queue before submitting. You can test when your job is scheduled to run by adding the --test-only flag to your sbatch line, that will cause the scheduler to print back when it thinks the job will execute. This is a good way of right sizing your job.

Topology

For certain codes, layout on the node (i.e. which cores on which CPU) and cluster (i.e. where the nodes are located relative to each other) matters. In these cases the topology of the run is critical to getting the peak speed out of the job. Without deep knowledge of the code base, it’s hard to know if your code is one of these codes, and in most cases your code is not.

In cases where the topology of the run matters, Slurm provides a number of options to require the scheduler to give you a certain layout for the job. Both the sbatch and srun commands have options to this effect. Note that the more constraints you add to a job the longer it will take the scheduler to find resources for your job to use. One should set the minimum necessary restrictions on a job to give the scheduler maximum flexibility. As before, it may be the case that you may see a significant speed up if given the right topology but if it comes at the cost of having to wait significantly longer to run, your TtS may actually not improve or even get worse.

GPUs

Many of the same rules that applied to cores also apply to GPU’s. For most codes, your job will use a single GPU. If your code uses multiple GPU’s then you can follow the same process as above for cores to see how your code scales. Note that currently GPU efficiency is not recorded in Slurm. As such you will want to use other tools like DCGM and nvtop to get statistics on how your job is doing.

Time

It should be stated upfront that Slurm does not charge you fairshare for time you do not use. If you ask for 3 days and only use 2 hours, the scheduler will only charge you for the 2 hours you actually used. This is different than Memory, Cores, and GPU’s where you will be charged for allocating those resources whether you use them or not as the scheduler had to block them off for you to use and could not give them to anyone else.

To accurately estimate time is important not for the sake of fairshare, but rather for the sake of scheduling. The scheduler only knows what you tell it; if you tell it that a job takes 3 days, it will assume it takes 3 days even if it really takes 2 hours.  Thus when the scheduler goes to consider the job for scheduling, it will look for an allocation block the size of the length of job you request. A more accurate time estimate means that the scheduler can fit your job into tighter spots in the giant game of Tetris it is playing.  Taking our previous example, it may be that there are no spots right now for a 3 day job, but a 2 hour job may run immediately because there is a gap that the scheduler can fit it into while waiting to schedule a large high priority job. This behavior is called backfill, and is one of the two loops the scheduler engages in when scheduling. Leveraging the backfill loop is important as it is the main method through which low priority jobs, even those with zero fairshare, get scheduled. You can leap frog ahead of higher priority jobs because your job happens to fill a gap.

Assuming you are running on the same hardware (for considerations regarding different types of hardware see the next section) then you can reliably predict the runtime for certain classes of jobs. Simply run a test job and then look at how long it took using sacct or seff. If you run a bunch of jobs you can use seff-account to see the distribution of run times. Once you have the runtime, round it up the nearest hour and that should cover most situations. Run times can vary for various reasons but typically not more than 10%, so if your job takes 10 hours, you should ask for around 12 hours.

Finally a word about minimum run times. As described above your goal is to minimize Time to Science (TtS). You may naively think that asking for very short amounts of time would decrease TtS even more, but this is incorrect. The scheduler takes time to actually schedule jobs no matter how small your job is. To put it bluntly, you do not want the scheduler doing more work to schedule your job than your actual job is doing. For super short jobs the scheduler can get into a thrashing state where it schedules a job, the job exits immediately, and then the scheduler has to fill that slot again, similar to trying to fill a tub with the drain open. To prevent this, we require jobs to run for at least 10 minutes. Ideally jobs would last for an hour or longer. Thus when you are doing work on the cluster try to make sure you batch in increments of longer than 10 minutes and ideally longer than an hour. This will help the scheduler, and make sure your TtS is as short as possible.

Hardware

For similar job types, the run time is usually the same, with an important caveat being that you need to run on the same hardware. Different types of cores and GPU’s have different capabilities and speeds.  It is important to know how your job behaves as you switch between them. We have a table of relative speeds on the Fairshare page. It should be noted that that table only applies if your code is fully utilizing the hardware in question (more on that in the optimization section), you should always test your code to see how it actually performs as certain CPU and GPU types may work better for your code than others despite what the officially advertised benchmarks say.  While we generally validate vendor advertised performance numbers, they only apply to heavily optimized codes designed for those specific chips, as such your code speed may vary substantially.

If your are submitting to gpu_requeue or serial_requeue you will notice that your run times will vary quite a bit. This is because gpu_requeue and serial_requeue are mosaic partitions with a wide variety of hardware and thus a wide variety of performance. In cases like that you can either be very specific about which type of hardware you want using the --constraint option, or you can simply increase your time estimate to be the maximum you expect it to take on the slowest hardware available. A good rule of thumb is a factor of three variance in speed. So if your job takes 3 hours on most hardware, give it 9 hours on serial_requeue as you may end up on a substantially slower host.

Storage

The final thing that can impact Job Efficiency is the storage you use. Nothing can drag down a fast code faster than slow IO speed (Input/Output). To select the right storage, please read our Data Storage Workflow page. In general, for jobs you will want to use either Global Scratch or Local Scratch. If your job is IO heavy (i.e. it is constantly talking to the storage), Local Scratch is strongly preferred. Please also see the Data Storage Workflow page for how to best lay out your file structure, as file structure layout can impact job performance as well.

Job Optimization

Now that we have dealt with Job Efficiency the next thing to look at is Job Optimization, after all the only way to improve your Time to Science (TtS) and increase your code capability after properly structuring your job is to improve the code itself. Job optimization can be very beneficial but can also take significant time. There are in general three methods to optimize your code, each taking different amounts of time.

  1. Compiler Version, Library Version, Containers and Optimization: Compilers, Libraries, and Containers are code as well and thus subject to improvement. Simply changing or updating your compiler, libraries, and container can sometimes lead to dramatic increases in performance. In addition compilers have different optimization flags you can use that will automatically optimize your code. This option is the fastest way to get optimized code as all the work is already done for you, you just need to select the right compiler, libraries, container, and options.
  2. Partial Code Rewrite: Looking through the code as it exists now and reworking portions can create speed ups. The process of reworking your code consists of finding places where the code is spending significant time and then refactoring the code; either by updating the logic, replacing the numerical method, or by substituting an optimized library. This process can take a few weeks to months but can give substantial increases in speed. However, this method cannot fix the basic structural problems with the code.
  3. Full Code Rewrite: This can could take a significant amount of time depending on the complexity of the code (for large codes this can take up to six months to a year to complete) but is the best way to optimize your code. It will allow you to fundamentally understand how your code operates and fix any major structural problems resulting in transformative increases in performance. If you go this route you should try the other two options first; as when you do the first two steps you will have a good understanding of the quirks of your code. You should also do a cost benefit analysis to figure out if the time spent is worth the potential gains. Make sure the project has a firm end goal in mind. If your code needs continual improvement, it may be time to hire a Research Software Engineer to do that very important and necessary code development work.

Regardless of the method, you will need to grow more acquainted with your code, its numerical methods, and how it interacts with the underlying hardware. While there are some generalized rules and things to look for when optimizing code, in the end it will depend on you turning your code from a blackbox into something you understand at a fundamental level. This is also where learning how to use various debuggers and code inspectors can be very beneficial as they can help identify which portions of the code to focus on.

Important final note: Always reconfirm the results of your code whenever you change your optimization. This goes for any changes to your code, but especially when you recompile with different compilers, optimization levels, libraries, etc. You should have a standard battery of tests you know the results of that you can run to reconfirm that the results did not change, or if they did they are acceptable changes. Optimization can change the numerical methods and order of operations leading to numerical drift. Sometimes that drift is fine as its at the edge of the mantissa. Sometimes though those changes at the edge of the mantissa can build and lead to substantial changes in results. Even if your code is confirmed as working, always have a healthy suspicion of your code results and engage in independent verification as code bugs and faults can produce results that look legitimate but were arrived at by faulty logic or code.

Below we are going to give some general rules regarding optimization as well as suggestions as to different ways to go about it.

Compiler Optimization

Compiler optimization means letting the compiler look through the code for things it can improve automatically. Maybe it will change the memory layout to make it more optimal, maybe it will notice that you are doing a certain numerical technique and then substitute in a better one, maybe it will change the order of operations to improve numerical speed. Regardless of what it tries, compiler optimization relies on the authors of the compiler and their deep knowledge of numerical methods and the underlying hardware to get improved speed. Compiler optimization really applies to those compiling from C, C++, or Fortran but higher level codes like Python and R, which lean on libraries that are written in C, C++, and Fortran can also benefit. Thus if you want to really optimize your Python or R code, getting those underlying libraries built in an optimized way can lead to speed ups.

There is a generally agreed upon standard for most compilers with regard to level of optimization. After all not all optimizations are numerically safe, or will produce gains in speed. Some may in fact slow things down. As such when using compiler optimization test your code speed and accuracy at different levels of optimization, and with different compilers. Each compiler has a different implementation of the standard, some are better for certain things than others. It is also worth reading the documentation for the compiler optimization levels to see what is included. A good exercise is to take each individual flag that makes up an optimization level and test to see if it speeds up your code and if it introduces numerical issues.

The standard code optimization (-O) levels are:

-O0: Not optimized at all.  The compiler just runs your code as is with zero work done. If you turn on debugging typically this is what your code will default to.

-O1: Numerically safe optimization. This level of optimization is guaranteed to be numerical stable and safe. No corners are cut, no compromises in numerical precision are made, nothing is reordered.

-O2: Mostly numerically safe optimization. This level of optimization is the default level for most compilers. At this level, in most cases, the optimizations made are numerically okay. Generally there is no sacrifice of numerical precision, though loops may be unrolled and reordered to make things more efficient.

-O3: Heavily optimized. This level of optimization takes the approach of trying to include every possible optimization whether numerically safe or not.

As you can see the various levels of optimization make certain assumptions about how numerically safe it is trying to be. Given this, you should always test your code to make sure that it runs as it should after compilation and does not produce errant results.

One other common optimization is to leverage special features found with different chipsets. Each generation of CPU has different features built into it that you can leverage. Some example features are SSE (Streaming SIMD Extensions), FMA (Fused Multiply Add), and AVX (Advance Vector Extensions). If your code is architected to use them, you can gain substantial speed by enabling these optimizations. There are three ways to do this:

  1. Turn on each feature individually: This allows you to pick and choose which you want and makes your compiled code portable across different chipsets.
  2. Specify chipset you are building for: Compilers include flags that allow you to target a specific type of chip and include all the relevant optimizations for it.  This approach works well if you have a uniform set of hardware you are running on, or if you are not sure what features your code will leverage. Note your code will not work on other chipsets.
  3. Have the compiler autodetect what chipset you are using: Compilers usually have a flag (i.e. -xHost) that will detect the chipset you are currently on and then build specifically for that. To do this properly you will need to make sure you are on the node that is of the same type that you will run your code on. In addition your code will not be portable.

It is worth noting again that not all optimizations are safe or beneficial. Heavily optimized code can lead to substantial bloat in memory usage with little material gain. Numerical issues may occur if the compiler makes bad assumptions about your code. You should only use up to the level of optimization that is stable and beneficial and no higher. If an optimization has no impact on your code performance, it is best to leave it off.

Important final note: Always remove debugging flags and options when running in production. Debugging flags will disable optimization even you tell the compiler to optimize, as the debugging flag overrides the optimization flag. Before going to production, remove debugging flags, recompile, and test your code for accuracy and performance.

Languages, Numerical Method, and Libraries

Selecting the correct language, numerical method, and libraries are important parts of code optimizations. You always want to select the right tool for the job. For some situations Python is good enough, for others you really need Fortran. An improved numerical method may give enormous speed ups but at the cost of increased memory, or vice versa. Swapping out code you wrote for a library maintained by a domain expert may be faster, or having a more integrated code may end up being quicker.

With languages, you are usually locked into a specific one unless you do a complete code rewrite. As such, you should learn the quirks of the language you are using and make sure your code conforms to the best standards and practices for that language. If you are looking to rewrite your code, then consider changing which language you are using. It may be that a different language may lead to more speed ups in the future. As a general rule, languages that are closer to the hardware (things like C or Fortran) can be made to go faster, but they also are trickier to use.

For numerical methods you will want to stay abreast of the current literature in your field and the field that generated the relevant numerical technique (e.g. matrix multiplication, sorting). Even small changes to a numerical technique can add up to large gains in speed. They can also dramatically impact memory utilization. Simplicity is also important, as in many cases a simpler method is faster just by dint of having to do less math and logic. This is not true in all cases though, so be sure to test and verify.

Libraries are another important tool in the toolbox. By using a library you leverage someone else’s time and experience to write optimized code. This saves you from having to debug the code and optimize it, you simply plug in the library and go. Libraries can still have flaws though, so you want to make sure you keep up to date and test. If you do find flaws you should contribute back to the community (i.e., report to the library’s developers) so that everyone benefits from the improvement you suggest. One other caution with libraries is that sometimes it is better to inline the code rather than go to a library, as the gains from using the library may not outweigh the cost of accessing the library. Libraries will not automatically make your code faster, but rather are a tool you can use to potentially get more speed and efficiency.

Containers

Probably the ultimate form of library is a well maintained container. Well optimized containers have the advantage of providing a highly customized stack of optimized libraries that will allow the code to get to near its peak performance. Containers are powerful tools for especially complex software stacks, as the container can provide optimization for each individual element of the stack and ensure that all the various versions interoperate properly. Containers are not free though and do have some performance overhead, due to having an abstraction layer between the software in the container and the system hardware. For peak performance you will want to build your software stack outside of the container in the native environment. In most cases though, the performance penalties are minimal and substantially outweighed by the performance gains of using a well maintained and well optimized software stack that you do not have to build yourself.  For best performance look for containers that are provided by the hardware vendor. For instance Nvidia provides a well curated list of containers built for its GPU’s. The vendor typically has the best knowledge of the internals of their hardware and thus will know how to get the most out of it. Containers provided by primary code authors are also good sources as the code author will have the best knowledge of the internals of their code base and how to best run it.

Containers can also be handy for users dealing with operations or code bases involving many files. By including these files in the container itself it effectively hides them from the underlying storage, the storage treats it as one large single file rather than lots of smaller files. Filesystems in general behave best when interacting with single large files as traversing between files is expense, especially when there are many files to deal with. Thus if your workflow either has a software that has many files, like a Python/Anaconda/Mamba environment, or your code is engaging in IO with many files, consider putting them inside a container.

Other General Rules

Here are some rules that did not fit into other sections but are things you can look for when optimizing your code.

  1. Remove Debugging Flags and Options: Cannot emphasize this enough. Production code should not be run in debug mode as it will slow things down substantially.
  2. Use the latest compilers and libraries: Implied above, but one of the first things to try is updating your compiler and library versions to see if the various improvements to those codes improve your code performance.
  3. Leave informative comments in your code: Comments are free and having good comments can help you understand your code and improve it. A very good practice is to cite the paper and specific equation or analysis you are using so you can find the original context.
  4. Make sure your loops are appropriately ordered for your arrays: Different languages have different array ordering as to which index is fastest to traverse in memory (for instance Fortran orders its arrays with the first index being fastest, in C it is the opposite). Be aware of this and arrange your arrays and loops appropriately.
  5. Avoid if statements buried in loops: if statements are not free and cost time to execute, thus it is best if you can execute it once rather than all the time.
  6. Use temporary variables to hold constants: Multiplications are faster than divisions or exponents. Thus, instead of pi/2, use 0.5*pi; instead of 5^2, use 5*5. In addition, if you have a complicated coefficient you are multiplying or dividing by repeatedly, consider calculating that coefficient once and storing that as a temporary variable. If your coefficients are related to each other by some constant value, also consider making that a constant. For instance if you are always using 4*pi/3, store that as a variable, and then use that in place of it where ever it appears.
  7. Use the right type, size, and level of precision for variables: Integer math is faster than floating point math. Single precision math is faster than double precision. 4 byte integers use up less space than 8 byte integers. Select the size and type necessary for the numerical precision and accuracy you require and no larger.
  8. If you have a heavy arithmetic section consider using small temporary arrays for the data you are manipulating: Long strings of math in a single line are hard for the compiler to optimize, and also trend towards mistakes. Consider breaking it up in to smaller chunks that eventually sum up to the total value you need. Be careful of round off error and order of operations issues with this.
  9. Lower your cache miss rate: CPU’s and GPU’s are built with onboard memory (typically called cache), you should try to keep your processing in this onboard memory and only go out to main memory when necessary. Cache is faster to access and generally small, so doing things in smaller chunks that reuse data will be more likely to drop in the cache layer.
  10. Be aware of first touch rule for memory allocation: Memory is typically allocated on an at-need basis, and the further the code needs to search in memory, the worse the performance. Allocate frequently used arrays and variables first.
  11. Reduce memory footprint: As a general rule you want to keep your memory usage to the bare minimum you need. The more temporary arrays and variables you keep the more memory bloat your code will have.
  12. Avoid over abstraction: Pointers are useful, but pointers to pointers to pointers are not. It makes it hard for the compiler to optimize and for you (and anyone that uses your code) to follow the code.
  13. Be specific and well defined: A well structured code is easier to optimize. Declare all your variables up front, allocate your arrays as soon as you can, do not leave the variable types ambiguous.
  14. Work in Memory and Not on Disk: Accessing storage, no matter how fast, takes far more time than accessing memory. Try to only read and write to storage when necessary. If possible spin off a separate process to handle reading and writing to disk so that your main process can continue work.
  15. Avoid large numbers of files: It is better from a IO performance standpoint to have lower numbers of large files on disk rather than many small files. Bundle your data together into larger files that you read from or write to all at once.
  16. Include Restarts/Checkpoints: Include the ability for your code to pick up from where it left off by writing restart/checkpoint data to disk. The restart data should be only what is sufficient to pick up from where your calculation left off. This will allow you to recover from crashes and leverage the requeue partitions. Restarts will also allow you to use partitions with shorter time limits to bridge yourself to a longer run (e.g. using ten 3 day runs to accumulate a 30 day run).

Parallelization

There are limits to how fast you can make any single code run in serial. Once this limit is hit, parallelization needs to be considered. Sometimes this parallelization is trivial, such as launching thousands of jobs at once each with different parameters to do a parameter sweep (this is known as an embarrasingly parallel workflow). However if your code needs to be tightly coupled then other methods of parallelism will need to be considered. The three main methods of parallelization are:

  1. SIMD: Singe Instruction Multiple Data
  2. Thread: Shared Memory
  3. Rank: Distributed Memory

Regardless of what method you use, the general rule is that you want to make sure as much of your code is parallelized as possible and that communications and computation are overlapped with each other.  It is also possible to use SIMD in conjunction with Threads in conjunction with Ranks, this is known as the hybrid approach. These can lead to very powerful codes that can scale up to the largest supercomputers in the world.

Some libraries and codes (for example MATLAB, PETSc, OpenFOAM, Python Multiprocessing, HDF5) will already have parallelization included. Check with the documentation and/or inquire of the developer as to if it is able to parallelize and what method it uses. Once you know that you will be able to get the most out of the built-in parallelization.

SIMD (Single Instruction Multiple Data)

Most processors have multiple channels that can execute a specific command simultaneously on a stream of data. This is built in to the chipset itself and compilers will automatically optimize code to leverage this behavior. You can intentionally design your code to better leverage it depending on which specific compiler and instruction set you are using (such as AVX).

Threads

Threading achieves parallelism by having a shared memory space but then running multiple computational streams (threads) across it to accomplish specific instructions. Thread based parallelism is typically fairly easy to accomplish as it requires no complex interprocess communications, all the changes to memory are readily visible to each thread. Typically all the coder needs to do is to indicate which loops and sections can be threaded, and the compiler takes care of the rest. Examples are OpenMP, OpenACC, Pthreads, and Cuda.

Rank

Rank based parallelism is the most powerful but also most technically demanding type of parallelism. Each process has its own memory space and the user has to manage inter process communications themselves. Key here is making sure that communications bottlenecks are minimal, and if they exist to overlap them with computation so they do not slow down code execution. The industry standard for doing this is called MPI (Message Passing Interface).

Profiling

Knowing where to focus your time for optimizing your code is important. You will gain the most speed by optimizing the part of your code that is currently occupying the most execution time, or using the most memory.  To figure this out you need to profile your code.

The easiest and most immediate way is to use print statements combined with printing how much time each section takes. Most languages have methods of printing out time stamps or calculating elapsed time, simply use those methods with judicious use of print statements and you can quickly find out where your code is spending most of its time. Generally you should instrument your code to give you overall timing estimates, especially if your code works on some sort of large loop (i.e. such as taking time steps for doing fluid dynamics). Print statements are the quickest and easiest way to get information on your code.

Besides print statements, various profilers exist that you can use to inspect your code. Profilers will give you far more information about your code, as well as suggestions as to where your code could be improved. They can give you super precise timing for your code as well as inform you what cache/memory level your code is touching. All of this rich information can be valuable for dialing in on particularly small sections of code or subtle issues that may be causing dramatic slow downs.

Below is a list of profilers you can use:

  • VTune: Intel’s profiler
  • NSight: Nvidia’s profiler
  • DCGM: Data Center GPU Manager from Nvidia
  • top: Not really a profiler but a useful system utility for monitoring live job performance.
  • nvtop: Similar to top but for gpus.

 

]]>
27060
Singularity https://docs.rc.fas.harvard.edu/kb/singularity-on-the-cluster/ Thu, 17 May 2018 15:55:54 +0000 https://www.rc.fas.harvard.edu/?page_id=18043 Introduction

Singularity enables users to have full control of their operating system environment (OS). This allows a non-privileged user (e.g. non- root, sudo, administrator, etc.) to “swap out” the Linux operating system and environment on the host machine (i.e., the cluster’s OS) for another Linux OS and computing environment that they can control (i.e., the container’s OS). For instance, the host system runs Rocky Linux but your application requires CentOS or Ubuntu Linux with a specific software stack. You can create a CentOS or Ubuntu image containing your software and its dependencies, and run your application on that host in its native CentOS or Ubuntu environment. Singularity leverages the resources of the host system, such as high-speed interconnect (e.g., InfiniBand), high-performance parallel file systems (e.g., Lustre /n/netscratch and /n/holylfs filesystems), GPUs, and other resources (e.g., licensed Intel compilers).

Note for Windows and MacOS: Singularity only supports Linux containers. You cannot create images that use Windows or MacOS (this is a restriction of the containerization model rather than Singularity).

Docker/Podman vs. Singularity

Podman (a Docker-compatible software tool for managing containers) is also supported on FASRC clusters. There are some important differences between Docker/Podman and Singularity:

  • Singularity allows running containers as a regular cluster user.
  • Docker/Podman and Singularity have their own container formats.
  • Docker/Podman (Open Container Initiative) containers may be imported to run via Singularity.

Singularity, SingularityCE, Apptainer

SingularityCE (Singularity Community Edition) and Apptainer are branches/children of the deprecated Singularity. SingularityCE is maintained by Sylabs while Apptainer is maintained by the Linux Foundation. By and large the two are interoperable with slightly different feature sets. The cluster uses SingularityCE, which we will refer to in this document as Singularity.

Singularity Glossary

  • SingularityCE or Apptainer or Podman or Docker: the containerization software
    • as in “SingularityCE 3.11” or “Apptainer 1.0”
  • Image: a compressed, usually read-only file that contains an OS and specific software stack
  • Container
    • The technology, e.g. “containers vs. virtual machines”
    • An instance of an image, e.g. “I will train a model using a Singularity container of PyTorch.”
  • Host: computer/supercomputer where the image is run

Singularity on the cluster

To use Singularity on the cluster one must first start an interactive, Open OnDemand, or batch job. Then you simply run singularity:

[jharvard@holy2c04309 ~]$ singularity --version
singularity-ce version 4.2.2-1.el8

SingularityCE Documentation

The SingularityCE User Guide has the latest documentation. You can also see the most up-to-date help on SingularityCE from the command line:

[jharvard@holy2c04309 ~]$ singularity --help

Linux container platform optimized for High Performance Computing (HPC) and
Enterprise Performance Computing (EPC)

Usage:
  singularity [global options...]

Description:
  Singularity containers provide an application virtualization layer enabling
  mobility of compute via both application and environment portability. With
  Singularity one is capable of building a root file system that runs on any
  other Linux system where Singularity is installed.

Options:
  -c, --config string   specify a configuration file (for root or
                        unprivileged installation only) (default
                        "/etc/singularity/singularity.conf")
  -d, --debug           print debugging information (highest verbosity)
  -h, --help            help for singularity
      --nocolor         print without color output (default False)
  -q, --quiet           suppress normal output
  -s, --silent          only print errors
  -v, --verbose         print additional information
      --version         version for singularity

Available Commands:
  build       Build a Singularity image
  cache       Manage the local cache
  capability  Manage Linux capabilities for users and groups
  completion  Generate the autocompletion script for the specified shell
  config      Manage various singularity configuration (root user only)
  delete      Deletes requested image from the library
  exec        Run a command within a container
  help        Help about any command
  inspect     Show metadata for an image
  instance    Manage containers running as services
  key         Manage OpenPGP keys
  oci         Manage OCI containers
  overlay     Manage an EXT3 writable overlay image
  plugin      Manage Singularity plugins
  pull        Pull an image from a URI
  push        Upload image to the provided URI
  remote      Manage singularity remote endpoints, keyservers and OCI/Docker registry credentials
  run         Run the user-defined default command within a container
  run-help    Show the user-defined help for an image
  search      Search a Container Library for images
  shell       Run a shell within a container
  sif         Manipulate Singularity Image Format (SIF) images
  sign        Add digital signature(s) to an image
  test        Run the user-defined tests within a container
  verify      Verify digital signature(s) within an image
  version     Show the version for Singularity

Examples:
  $ singularity help <command> [<subcommand>]
  $ singularity help build
  $ singularity help instance start


For additional help or support, please visit https://www.sylabs.io/docs/

Working with Singularity Images

Singularity uses a portable, single-file container image format known as the Singularity Image Format (SIF). You can scp or rsync these to the cluster as you would do with any other file. See Copying Data to & from the cluster using SCP or SFTP for more information. You can also download them from various container registries or build your own.

When working with images you can:

For more examples and details, see SingularityCE quick start guide.

Working with Singularity Images Interactively

Singularity syntax

singularity <command> [options] <container_image.sif>
Commands
  • shell: run an interactive shell inside the container
  • exec: execute a command
  • run: launch the runscript

For this example, we will use the laughing cow Singularity image from Sylabs library.

First, request interactive job (for more details about interactive jobs on Cannon, see and on FASSE see) and download the laughing cow lolcow_latest.sif Singularity image:

# request interative job
[jharvard@holylogin01 ~]$ salloc -p test -c 1 -t 00-01:00 --mem=4G

# pull image from Sylabs library
[jharvard@holy2c02302 sylabs_lib]$ singularity pull library://lolcow
INFO:    Downloading library image
90.4MiB / 90.4MiB [=====================================] 100 % 7.6 MiB/s 0s

shell

With the shell command, you can start a new shell within the container image and interact with it as if it were a small virtual machine.

Note that the shell command does not source ~/.bashrc and ~/bash_profile. Therefore, the shell command is useful if customizations in your ~/.bashrc and ~/bash_profile are not supposed to be sourced within the Singularity container.

# launch container with shell command
[jharvard@holy2c02302 sylabs_lib]$ singularity shell lolcow_latest.sif

# test some linux commands within container
Singularity> pwd
/n/holylabs/LABS/jharvard_lab/Users/jharvard/sylabs_lib
total 95268
-rwxr-xr-x 1 jharvard jharvard_lab  2719744 Mar  9 14:27 hello-world_latest.sif
drwxr-sr-x 2 jharvard jharvard_lab     4096 Mar  1 15:21 lolcow
-rwxr-xr-x 1 jharvard jharvard_lab 94824197 Mar  9 14:56 lolcow_latest.sif
drwxr-sr-x 2 jharvard jharvard_lab     4096 Mar  1 15:23 ubuntu22.04
Singularity> id
uid=21442(jharvard) gid=10483(jharvard_lab) groups=10483(jharvard_lab)
Singularity> cowsay moo
 _____
< moo >
 -----
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

# exit the container
Singularity> exit
[jharvard@holy2c02302 sylabs_lib]$

exec

The exec command allows you to execute a custom command within a container by specifying the image file. For instance, to execute the cowsay program within the lolcow_latest.sif container:

[jharvard@holy2c02302 sylabs_lib]$ singularity exec lolcow_latest.sif cowsay moo
 _____
< moo >
 -----
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
[jharvard@holy2c02302 sylabs_lib]$ singularity exec lolcow_latest.sif cowsay "hello FASRC"
 _____________
< hello FASRC >
 -------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

run

Singularity containers may contain runscripts. These are user defined scripts that define the actions a container should perform when someone runs it. The runscript can be triggered with the run command, or simply by calling the container as though it were an executable.

Using the run command:

[jharvard@holy2c02302 sylabs_lib]$ singularity run lolcow_latest.sif
 _____________________________
< Thu Mar 9 15:15:56 UTC 2023 >
 -----------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Running as the container were an executable file:

[jharvard@holy2c02302 sylabs_lib]$ ./lolcow_latest.sif
 _____________________________
< Thu Mar 9 15:17:06 UTC 2023 >
 -----------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

To view the runscript of a Singularity image:

[jharvard@holy2c02302 sylabs_lib]$ $ singularity inspect -r lolcow_latest.sif 

#!/bin/sh

    date | cowsay | lolcat

GPU Example

First, start an interactive job in the gpu or gpu_test partition and then download the Singularity image.

# request interactive job on gpu_test partition
[jharvard@holylogin01 gpu_example]$ salloc -p gpu_test --gres=gpu:1 --mem 8G -c 4 -t 60

# build singularity image by pulling container from Docker Hub
[jharvard@holygpu7c1309 gpu_example]$ singularity pull docker://tensorflow/tensorflow:latest-gpu
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob 521d4798507a done
Copying blob 2798fbbc3b3b done
Copying blob 4d8ee731d34e done
Copying blob 92d2e1452f72 done
Copying blob 6aafbce389f9 done
Copying blob eaead16dc43b done
Copying blob 69cc8495d782 done
Copying blob 61b9b57b3915 done
Copying blob eac8c9150c0e done
Copying blob af53c5214ca1 done
Copying blob fac718221aaf done
Copying blob 2047d1a62832 done
Copying blob 9a9a3600909b done
Copying blob 79931d319b40 done
Copying config bdb8061f4b done
Writing manifest to image destination
Storing signatures
2023/03/09 13:52:18  info unpack layer: sha256:eaead16dc43bb8811d4ff450935d607f9ba4baffda4fc110cc402fa43f601d83
2023/03/09 13:52:19  info unpack layer: sha256:2798fbbc3b3bc018c0c246c05ee9f91a1ebe81877940610a5e25b77ec5d4fe24
2023/03/09 13:52:19  info unpack layer: sha256:6aafbce389f98e508428ecdf171fd6e248a9ad0a5e215ec3784e47ffa6c0dd3e
2023/03/09 13:52:19  info unpack layer: sha256:4d8ee731d34ea0ab8f004c609993c2e93210785ea8fc64ebc5185bfe2abdf632
2023/03/09 13:52:19  info unpack layer: sha256:92d2e1452f727e063220a45c1711b635ff3f861096865688b85ad09efa04bd52
2023/03/09 13:52:19  info unpack layer: sha256:521d4798507a1333de510b1f5474f85d3d9a00baa9508374703516d12e1e7aaf
2023/03/09 13:52:21  warn rootless{usr/lib/x86_64-linux-gnu/gstreamer1.0/gstreamer-1.0/gst-ptp-helper} ignoring (usually) harmless EPERM on setxattr "security.capability"
2023/03/09 13:52:54  info unpack layer: sha256:69cc8495d7822d2fb25c542ab3a66b404ca675b376359675b6055935260f082a
2023/03/09 13:52:58  info unpack layer: sha256:61b9b57b3915ef30727fb8807d7b7d6c49d7c8bdfc16ebbc4fa5a001556c8628
2023/03/09 13:52:58  info unpack layer: sha256:eac8c9150c0e4967c4e816b5b91859d5aebd71f796ddee238b4286a6c58e6623
2023/03/09 13:52:59  info unpack layer: sha256:af53c5214ca16dbf9fd15c269f3fb28cefc11121a7dd7c709d4158a3c42a40da
2023/03/09 13:52:59  info unpack layer: sha256:fac718221aaf69d29abab309563304b3758dd4f34f4dad0afa77c26912aed6d6
2023/03/09 13:53:00  info unpack layer: sha256:2047d1a62832237c26569306950ed2b8abbdffeab973357d8cf93a7d9c018698
2023/03/09 13:53:15  info unpack layer: sha256:9a9a3600909b9eba3d198dc907ab65594eb6694d1d86deed6b389cefe07ac345
2023/03/09 13:53:15  info unpack layer: sha256:79931d319b40fbdb13f9269d76f06d6638f09a00a07d43646a4ca62bf57e9683
INFO:    Creating SIF file...

Run the container with GPU support, see available GPUs, and check if tensorflow can detect them:

# run the container
[jharvard@holygpu7c1309 gpu_example]$ singularity shell --nv tensorflow_latest-gpu.sif
Singularity> nvidia-smi
Thu Mar  9 18:57:53 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:2F:00.0 Off |                    0 |
| N/A   36C    P0    23W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   33C    P0    23W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# check if `tensorflow` can see GPUs
Singularity> python
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
2023-03-09 19:00:15.107804: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print(device_lib.list_local_devices())
2023-03-09 19:00:20.010087: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-09 19:00:24.024427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:0 with 30960 MB memory:  -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
2023-03-09 19:00:24.026521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:1 with 30960 MB memory:  -> device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
2023-03-09 19:00:24.027583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:2 with 30960 MB memory:  -> device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
2023-03-09 19:00:24.028227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:3 with 30960 MB memory:  -> device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0

... omitted output ...

incarnation: 3590943835431918555
physical_device_desc: "device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0"
xla_global_id: 878896533
]

Running Singularity Images in Batch Jobs

You can also use Singularity images within a non-interactive batch script as you would any other command. If your image contains a run-script then you can use singularity run to execute the run-script in the job. You can also use singularity exec to execute arbitrary commands (or scripts) within the image.

Below is an example batch-job submission script using the laughing cow lolcow_latest.sif to print out information about the native OS of the image.

File singularity.sbatch:

#!/bin/bash
#SBATCH -J singularity_test
#SBATCH -o singularity_test.out
#SBATCH -e singularity_test.err
#SBATCH -p test
#SBATCH -t 0-00:10
#SBATCH -c 1
#SBATCH --mem=4G

# Singularity command line options
singularity exec lolcow_latest.sif cowsay "hello from slurm batch job"

Submit a slurm batch job:

[jharvard@holy2c02302 jharvard]$ sbatch singularity.sbatch

Upon the job completion, the standard output is located in the file singularity_test.out:

 [jharvard@holy2c02302 jharvard]$ cat singularity_test.out
  ____________________________
< hello from slurm batch job >
 ----------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

GPU Example Batch Job

File singularity_gpu.sbatch (ensure to include the --nv flag after singularity exec):

#!/bin/bash
#SBATCH -J singularity_gpu_test
#SBATCH -o singularity_gpu_test.out
#SBATCH -e singularity_gpu_test.err
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH -t 0-00:10
#SBATCH -c 1
#SBATCH --mem=8G

# Singularity command line options
singularity exec --nv lolcow_latest.sif nvidia-smi
Submit a slurm batch job:
[jharvard@holy2c02302 jharvard]$ sbatch singularity_gpu.sbatch
Upon the job completion, the standard output is located in the file singularity_gpu_test.out:
$ cat singularity_gpu_test.out
Thu Mar  9 20:40:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Accessing Files

Files and directories on the cluster are accessible from within the container. By default, directories under /n, $HOME, $PWD, and /tmp are available at runtime inside the container.

See these variables on the host operating system:

[jharvard@holy2c02302 jharvard]$ echo $PWD
/n/holylabs/LABS/jharvard_lab/Lab/jharvard
[jharvard@holy2c02302 jharvard]$ echo $HOME
/n/home01/jharvard
[jharvard@holy2c02302 jharvard]$ echo $SCRATCH
/n/netscratch
The same variables within the container:
[jharvard@holy2c02302 jharvard]$ singularity shell lolcow_latest.sif
Singularity> echo $PWD
/n/holylabs/LABS/jharvard_lab/Lab/jharvard
Singularity> echo $HOME
/n/home01/jharvard
Singularity> echo $SCRATCH
/n/netscratch
You can specify additional directories from the host system such that they can be accessible from the container. This process is called bind mount into your container and is done with the --bind option.

For instance, if you first create a file hello.dat in the /scratch directory on the host system. Then, you can execute from within the container by bind mounting /scratch to the /mnt directory inside the container:

[jharvard@holy2c02302 jharvard]$ echo 'Hello from file in mounted directory!' > /scratch/hello.dat
[jharvard@holy2c02302 jharvard]$ singularity shell --bind /scratch:/mnt lolcow_latest.sif
Singularity> cd /mnt/
Singularity> ls
cache  hello.dat
Singularity> cat hello.dat
Hello from file in mounted directory

If you don’t use the --bind option, the file will not be available in the directory /mnt inside the container:

[jharvard@holygpu7c1309 sylabs_lib]$ singularity shell lolcow_latest.sif
Singularity> cd /mnt/
Singularity> ls
Singularity>

Submitting Jobs Within a Singularity Container

Note: Submitting jobs from within a container may or may not work out of the box. This is due to possible environmental variable mismatch, as well as operating system and image library issues. It is important to validate that submitted jobs are properly constructed and operating as expected. If possible it is best to submit jobs outside the container in the host environment.

If you would like to submit slurm jobs from inside the container, you can bind the directories where the slurm executables are. The environmental variable SINGULARITY_BIND stores the directories of the host system that are accessible from inside the container. Thus, slurm commands can be accessible by adding the following code to you slurm batch script before the singularity execution:

export SINGULARITY_BIND=$(tr '\n' ',' <<END
/etc/nsswitch.conf
/etc/slurm
/etc/sssd/
/lib64/libnss_sss.so.2:/lib/libnss_sss.so.2
/slurm
/usr/bin/sacct
/usr/bin/salloc
/usr/bin/sbatch
/usr/bin/scancel
/usr/bin/scontrol
/usr/bin/scrontab
/usr/bin/seff
/usr/bin/sinfo
/usr/bin/squeue
/usr/bin/srun
/usr/bin/sshare
/usr/bin/sstat
/usr/bin/strace
/usr/lib64/libmunge.so.2
/usr/lib64/slurm
/var/lib/sss
/var/run/munge:/run/munge
END
)

Build Your Own Singularity Container

You can build or import a Singularity container in different ways. Common methods include:

  1. Download an existing container from the SingularityCE Container Library or another image repository. This will download an existing Singularity image to the FASRC cluster.
  2. Build a SIF from an OCI container image located in Docker Hub or another OCI container registry (e.g., quay.io, NVIDIA NGC Catalog, GitHub Container Registry). This will download the OCI container image and convert it into a Singularity container image on the FASRC cluster.
  3. Build a SIF file from a Singularity definition file directly on the FASRC cluster.
  4. Build an OCI-SIF from a local Dockerfile using option --oci. The resulting image can be pushed to an OCI container registry (e.g., Docker Hub) for distribution/use by other container runtimes such as Docker.

NOTE: for all options above, you need to be in a compute node. Singularity on the clusters shows how to request an interactive job on Cannon and FASSE.

Download Existing Singularity Container from Library or Registry

Download the laughing cow (lolcow) image from Singularity library with singularity pull:

[jharvard@holy2c02302 ~]$ singularity pull lolcow.sif library://lolcow
INFO:    Starting build...
INFO:    Using cached image
INFO:    Verifying bootstrap image /n/home05/jharvard/.singularity/cache/library/sha256.cef378b9a9274c20e03989909930e87b411d0c08cf4d40ae3b674070b899cb5b
INFO:    Creating SIF file...
INFO:    Build complete: lolcow.sif

Download a custom JupyterLab and Seaborn image from the Seqera Containers registry (which builds/hosts OCI and Singularity container images comprising user-selected conda and Python packages):

[jharvard@holy2c02302 ~]$ singularity pull oras://community.wave.seqera.io/library/jupyterlab_seaborn:a7115e98a9fc4dbe
INFO:    Downloading oras
287.0MiB / 287.0MiB [=======================================] 100 % 7.0 MiB/s 0s

Download Existing Container from Docker Hub

Build the laughing cow (lolcow) image from Docker Hub:

[jharvard@holy2c02302 ~]$ singularity pull lolcow.sif docker://sylabsio/lolcow
INFO:    Starting build...
Getting image source signatures
Copying blob 5ca731fc36c2 done
Copying blob 16ec32c2132b done
Copying config fd0daa4d89 done
Writing manifest to image destination
Storing signatures
2023/03/01 10:29:37  info unpack layer: sha256:16ec32c2132b43494832a05f2b02f7a822479f8250c173d0ab27b3de78b2f058
2023/03/01 10:29:38  info unpack layer: sha256:5ca731fc36c28789c5ddc3216563e8bfca2ab3ea10347e07554ebba1c953242e
INFO:    Creating SIF file...
INFO:    Build complete: lolcow.sif

Build the latest Ubuntu image from Docker Hub:

[jharvard@holy2c02302 ~]$ singularity pull ubuntu.sif docker://ubuntu
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
INFO:    Fetching OCI image...
INFO:    Extracting OCI image...
INFO:    Inserting Singularity configuration...
INFO:    Creating SIF file...
[jharvard@holy2c02302 ~]$ singularity exec ubuntu.sif head -n 1 /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"

Note that to build images that are downloaded from Docker Hub or another OCI registry, you can either use the commands build or pull.

Build a Singularity Container from Singularity Definition File

Singularity supports building definition files using --fakeroot. This feature leverages rootless containers.

Step 1: Write/obtain a definition file. You will need a definition file specifying environment variables, packages, etc. Your SingularityCE image will be based on this file. See SingularityCE definition file docs for more details.

This is an example of the laughing cow definition file:

Bootstrap: docker
From: ubuntu:22.04

%post
    apt-get -y update
    apt-get -y install cowsay lolcat

%environment
    export LC_ALL=C
    export PATH=/usr/games:$PATH

%runscript
    date | cowsay | lolcat

Step 2: Build SingularityCE image

Build laughing cow image.

[jharvard@holy8a26602 jharvard]$ singularity build --fakeroot lolcow.sif lolcow.def
INFO: Starting build...
INFO: Fetching OCI image...
28.2MiB / 28.2MiB [=================================================================================================================================================] 100 % 27.9 MiB/s 0s
INFO: Extracting OCI image...
INFO: Inserting Singularity configuration...
INFO: Running post scriptlet
... omitted output ...

Running hooks in /etc/ca-certificates/update.d...
done.
INFO:    Adding environment to container
INFO:    Adding runscript
INFO:    Creating SIF file...
INFO:    Build complete: lolcow.sif

Building a Singularity container from a Dockerfile: OCI mode

SingularityCE supports building containers from Dockerfiles in OCI mode, using a bundled version of the BuildKit container image builder used in recent versions of Docker, resulting in an OCI-SIF image file (as opposed to native mode, which supports building a SIF image from a Singularity definition file). OCI mode enables the Docker-like –compat flag, enforcing a greater degree of isolation between the container and the host environments for Docker/Podman/OCI compatibility.

An example OCI Dockerfile:

FROM ubuntu:22.04 

RUN apt-get -y update \ 
 && apt-get -y install cowsay lolcat

ENV LC_ALL=C PATH=/usr/games:$PATH 

ENTRYPOINT ["/bin/sh", "-c", "date | cowsay | lolcat"]

Build the OCI-SIF (Note that on the FASRC cluster the XDG_RUNTIME_DIR environment variable currently needs to be explicitly set to a node-local user-writable directory, such as shown below):

[jharvard@holy2c02302 ~]$ XDG_RUNTIME_DIR=$(mktemp -d) singularity build --oci lolcow.oci.sif Dockerfile
INFO:    singularity-buildkitd: running server on /tmp/tmp.fCjzW2QnfV/singularity-buildkitd/singularity-buildkitd-3709445.sock
... omitted ouptput ...
INFO:    Terminating singularity-buildkitd (PID 3709477)
WARNING: removing singularity-buildkitd temporary directory /tmp/singularity-buildkitd-2716062861                                                                           
INFO:    Build complete: lolcow.oci.sif

To run the ENTRYPOINT command (equivalent to the Singularity definition file runscript):

[jharvard@holy2c02302 ~]$ singularity run --oci lolcow.oci.sif

OCI mode limitations

  • (As of SingularityCE 4.2) If the Dockerfile contains “USER root” as the last USER instruction, the singularity exec/run –fakeroot or –no-home options must be specified to use the OCI-SIF, or a tmpfs error will result.
  • Portability note: Apptainer does not support OCI mode, and OCI-SIF files cannot be used with Apptainer.

BioContainers

Cluster nodes automount a CernVM-File System at /cvmfs/singularity.galaxyproject.org/. This provides a universal file system namespace to Singularity images for the BioContainers project, which comprises container images automatically generated from Bioconda software packages. The Singularity images are organized into a directory hierarchy following the convention:

/cvmfs/singularity.galaxyproject.org/FIRST_LETTER/SECOND_LETTER/PACKAGE_NAME:VERSION--CONDA_BUILD

For example:

singularity exec /cvmfs/singularity.galaxyproject.org/s/a/samtools:1.13--h8c37831_0 samtools --help

The Bioconda package index lists all software available in /cvmfs/singularity.galaxyproject.org/, while the BioContainers registry provides a searchable interface.

NOTE: There will be a 10-30 second delay when first accessing /cvmfs/singularity.galaxyproject.org/ on a compute node on which it is not currently mounted; in addition, there will be a delay when accessing a Singularity image on a compute node where it has not already been accessed and cached to node-local storage.

BioContainer Images in Docker Hub

A small number of Biocontainers images are available only in DockerHub under the biocontainers organization, and are not available on Cannon under /cvmfs/singularity.galaxyproject.org/.

See BioContainers GitHub for a complete list of BioContainers images available in DockerHub (note that many of the applications listed in that GitHub repository have since been ported to Bioconda, and are thus available in /cvmfs/singularity.galaxyproject.org, but a subset are still only available in DockerHub).

These images can be fetched and built on Cannon using the singularity pull command:

singularity docker://biocontainers/<image>:<tag>
For example, for the container cellpose with tag 2.1.1_vc1 (cellpose Docker Hub page):
[jharvard@holy2c02302 bio]$ singularity pull --disable-cache docker://biocontainers/cellpose:2.1.1_cv1
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
2023/03/13 15:58:16  info unpack layer: sha256:a603fa5e3b4127f210503aaa6189abf6286ee5a73deeaab460f8f33ebc6b64e2
INFO:    Creating SIF file...
The sif image file cellpose_2.1.1_cv1.sif will be created:
[jharvard@holy2c02302 bio]$ ls -lh
total 2.5G
-rwxr-xr-x 1 jharvard jharvard_lab 2.4G Mar 13 15:59 cellpose_2.1.1_cv1.sif
-rwxr-xr-x 1 jharvard jharvard_lab  72M Mar 13 12:06 lolcow_latest.sif

BioContainer and Package Tips

  • The registry https://biocontainers.pro may be slow
  • We recommend to first check the Bioconda package index, as it quickly provides a complete list of Bioconda packages, all of which have a corresponding biocontainers image in /cvmfs/singularity.galaxyproject.org/
  • If an image doesn’t exist there, then there is a small chance there might be one generated from a Dockerfile in BioContainer GitHub
  • If your package is listed in the BioContainer GitHub, search for the package in Docker Hub, under the biocontainers organization(e.g. search for biocontainers/<package>)

Parallel computing with Singularity

Singularity is capable of both OpenMP and MPI parallelization. OpenMP is mostly trivial, you simply need a OpenMP enabled code and compiler and then properly set the normal variables. We have an example code on our User Codes repo. MPI on the other hand is much more involved.

MPI Applications

The goal of these follow instructions is to help you run Message Passing Interface (MPI) programs using Singularity containers on the FAS RC cluster. The MPI standard is used to implement distributed parallel applications across compute nodes of a single HPC cluster, such as Cannon, or across multiple compute systems. The two major open-source implementations of MPI are Mpich (and its derivatives, such as Mvapich), and OpenMPI. The most widely used MPI implementation on Cannon is OpenMPI.

There are several ways of developing and running MPI applications using Singularity containers, where the most popular method relies on the MPI implementation available on the host machine. This approach is named Host MPI or the Hybrid model since it uses both the MPI implementation on the host and the one in the container.

The key idea behind the Hybrid method is that when you execute a Singularity container with a MPI application, you call mpiexec, mpirun, or srun, e.g., when using the Slurm scheduler, on the singularity command itself. Then the MPI process outside of the container will work together with MPI inside the container to initialize the parallel job. Therefore, it is very important that the MPI flavors and versions inside the container and on the host match.

Code examples below can be found on our User Codes repo.

Example MPI Code

To illustrate how Singularity can be used with MPI applications, we will use a simple MPI code implemented in Fortran 90, mpitest.f90:

!=====================================================
! Fortran 90 MPI example: mpitest.f90
!=====================================================
program mpitest
  implicit none
  include 'mpif.h'
  integer(4) :: ierr
  integer(4) :: iproc
  integer(4) :: nproc
  integer(4) :: i
  call MPI_INIT(ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,iproc,ierr)
  do i = 0, nproc-1
     call MPI_BARRIER(MPI_COMM_WORLD,ierr)
     if ( iproc == i ) then
        write (6,*) 'Rank',iproc,'out of',nproc
     end if
  end do
  call MPI_FINALIZE(ierr)
  if ( iproc == 0 ) write(6,*)'End of program.'
  stop
end program mpitest

Singularity Definition File

To build Singularity images you need to write a Definition File, where the the exact implementation will depend on the available MPI flavor on the host machine.

OpenMPI

If you intend to use OpenMPI, the definition file could look like, e.g., the one below:

Bootstrap: yum
OSVersion: 7
MirrorURL: http://mirror.centos.org/centos-%{OSVERSION}/%{OSVERSION}/os/$basearch/
Include: yum

%files
  mpitest.f90 /home/

%environment
  export OMPI_DIR=/opt/ompi
  export SINGULARITY_OMPI_DIR=$OMPI_DIR
  export SINGULARITYENV_APPEND_PATH=$OMPI_DIR/bin
  export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$OMPI_DIR/lib

%post
  yum -y install vim-minimal
  yum -y install gcc
  yum -y install gcc-gfortran
  yum -y install gcc-c++
  yum -y install which tar wget gzip bzip2
  yum -y install make
  yum -y install perl

  echo "Installing Open MPI ..."
  export OMPI_DIR=/opt/ompi
  export OMPI_VERSION=4.1.1
  export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-$OMPI_VERSION.tar.bz2"
  mkdir -p /tmp/ompi
  mkdir -p /opt
  # --- Download ---
  cd /tmp/ompi
  wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
  # --- Compile and install ---
  cd /tmp/ompi/openmpi-$OMPI_VERSION
  ./configure --prefix=$OMPI_DIR && make -j4 && make install
  # --- Set environmental variables so we can compile our application ---
  export PATH=$OMPI_DIR/bin:$PATH
  export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
  export MANPATH=$OMPI_DIR/share/man:$MANPATH
  # --- Compile our application ---
  cd /home
  mpif90 -o mpitest.x mpitest.f90 -O2
MPICH

If you intend to use MPICH, the definition file could look like, e.g., the one below:

Bootstrap: yum
OSVersion: 7
MirrorURL: http://mirror.centos.org/centos-%{OSVERSION}/%{OSVERSION}/os/$basearch/
Include: yum

%files
  /n/home06/pkrastev/holyscratch01/Singularity/MPI/mpitest.f90 /home/

%environment
  export SINGULARITY_MPICH_DIR=/usr

%post
  yum -y install vim-minimal
  yum -y install gcc
  yum -y install gcc-gfortran
  yum -y install gcc-c++
  yum -y install which tar wget gzip
  yum -y install make
  cd /root/
  wget http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz
  tar xvfz mpich-3.1.4.tar.gz
  cd mpich-3.1.4/
  ./configure --prefix=/usr && make -j2 && make install
  cd /home
  mpif90 -o mpitest.x mpitest.f90 -O2
  cp mpitest.x /usr/bin/

Building Singularity Image

You can use the below commands to build your Singularity images, e.g.:

# --- Building the OpenMPI based image ---
$ singularity build openmpi_test.simg openmpi_test_centos7.def
# --- Building the based Mpich image ---
$ singularity build mpich_test.simg mpich_test.def

These will generate the Singularity image files openmpi_test.simg and mpich_test.simg respectively.

Executing MPI Applications with Singularity

On the FASRC cluster the standard way to execute MPI applications is through a batch-job submission script. Below are two examples, one using OpenMPI, and another MPICH.

OpenMPI
#!/bin/bash
#SBATCH -p test
#SBATCH -n 8
#SBATCH -J mpi_test
#SBATCH -o mpi_test.out
#SBATCH -e mpi_test.err
#SBATCH -t 30
#SBATCH --mem-per-cpu=1000

# --- Set up environment ---
export UCX_TLS=ib
export PMIX_MCA_gds=hash
export OMPI_MCA_btl_tcp_if_include=ib0
module load gcc/10.2.0-fasrc01 
module load openmpi/4.1.1-fasrc01

# --- Run the MPI application in the container ---
srun -n 8 --mpi=pmix singularity exec openmpi_test.simg /home/mpitest.x

Note: Please notice that the version of the OpenMPI implementation used on the host need to match the one in the Singularity container. In this case this is version 4.1.1.

If the above script is named run.sbatch.ompi, the MPI Singularity job is submitted as usual with:

sbatch run.sbatch.ompi
MPICH
#!/bin/bash
#SBATCH -p test
#SBATCH -n 8
#SBATCH -J mpi_test
#SBATCH -o mpi_test.out
#SBATCH -e mpi_test.err
#SBATCH -t 30
#SBATCH --mem-per-cpu=1000

# --- Set up environment ---
module load python/3.8.5-fasrc01
source activate python3_env1

# --- Run the MPI application in the container ---
srun -n 8 --mpi=pmi2 singularity exec mpich_test.simg /usr/bin/mpitest.x

If the above script is named run.sbatch.mpich, the MPI Singularity job is submitted as usual with:

$ sbatch run.sbatch.mpich

Note: Please notice that we don’t have Mpich installed as a software module on the FASRC cluster and therefore this example assumes that Mpich is installed in your user, or lab, environment. The easiest way to do this is through a conda environment. You can find more information on how to set up conda environments in our computing environment here.

Provided you have set up and activated a conda environment named, e.g., python3_env1, Mpich version 3.1.4 can be installed with:

$ conda install mpich==3.1.4

Example Output

$ cat mpi_test.out
 Rank           0 out of           8
 Rank           1 out of           8
 Rank           2 out of           8
 Rank           3 out of           8
 Rank           4 out of           8
 Rank           5 out of           8
 Rank           6 out of           8
 Rank           7 out of           8
 End of program.

Compiling Code with OpenMPI inside Singularity Container

To compile inside the Singularity container, we need to request a compute node to run Singularity:

$ salloc -p test --time=0:30:00 --mem=1000 -n 1

Using the file compile_openmpi.sh, you can compile mpitest.f90 by executing bash compile_openmpi.sh inside the container openmpi_test.simg

$ cat compile_openmpi.sh
#!/bin/bash

export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH

# compile fortran program
mpif90 -o mpitest.x mpitest.f90 -O2

# compile c program
mpicc -o mpitest.exe mpitest.c

$ singularity exec openmpi_test.simg bash compile_openmpi.sh

In compile_openmpi.sh, we also included the compilation command for a c program.

Online Trainings Materials

References

]]>
18043
Hybrid (MPI+OpenMP) Codes on the FASRC cluster https://docs.rc.fas.harvard.edu/kb/hybrid-mpiopenmp-codes-on-odyssey/ Thu, 22 Oct 2015 20:24:56 +0000 https://rc.fas.harvard.edu/?page_id=14121  

Introduction

This page will help you compile and run hybrid (MPI+OpenMP) applications on the cluster. Currently we have both OpenMPI and Mvapich2 MPI libraries available, compiled with both Intel and GNU compiler suits.

Example Code

Below are simple hybrid example codes in Fortran 90 and C++.
Fortran 90:


!=====================================================
! Program: hybrid_test.f90 (MPI + OpenMP)
!          FORTRAN 90 example - program prints out
!          rank of each MPI process and OMP thread ID
!=====================================================
program hybrid_test
  implicit none
  include "mpif.h"
  integer(4) :: ierr
  integer(4) :: iproc
  integer(4) :: nproc
  integer(4) :: icomm
  integer(4) :: i
  integer(4) :: j
  integer(4) :: nthreads
  integer(4) :: tid
  integer(4) :: omp_get_num_threads
  integer(4) :: omp_get_thread_num
  call MPI_INIT(ierr)
  icomm = MPI_COMM_WORLD
  call MPI_COMM_SIZE(icomm,nproc,ierr)
  call MPI_COMM_RANK(icomm,iproc,ierr)
!$omp parallel private( tid )
  tid = omp_get_thread_num()
  nthreads = omp_get_num_threads()
  do i = 0, nproc-1
     call MPI_BARRIER(icomm,ierr)
     do j = 0, nthreads-1
        !$omp barrier
        if ( iproc == i .and. tid == j ) then
           write (6,*) "MPI rank:", iproc, " with thread ID:", tid
        end if
     end do
  end do
!$omp end parallel
  call MPI_FINALIZE(ierr)
  stop
end program hybrid_test

C++:


//==========================================================
//  Program: hybrid_test.cpp (MPI + OpenMP)
//           C++ example - program prints out
//           rank of each MPI process and OMP thread ID
//==========================================================
#include <iostream>
#include <mpi.h>
#include <omp.h>
using namespace std;
int main(int argc, char** argv){
  int iproc;
  int nproc;
  int i;
  int j;
  int nthreads;
  int tid;
  int provided;
  MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_rank(MPI_COMM_WORLD,&iproc);
  MPI_Comm_size(MPI_COMM_WORLD,&nproc);
#pragma omp parallel private( tid )
  {
    tid = omp_get_thread_num();
    nthreads = omp_get_num_threads();
    for ( i = 0; i <= nproc - 1; i++ ){
      MPI_Barrier(MPI_COMM_WORLD);
      for ( j = 0; j <= nthreads - 1; j++ ){
        if ( (i == iproc) && (j == tid) ){
          cout << "MPI rank: " << iproc << " with thread ID: " << tid << endl;
        }
      }
    }
  }
  MPI_Finalize();
  return 0;
}

Compiling the program

MPI Intel, Fortran 90: [username@rclogin02 ~]$ mpif90 -o hybrid_test.x hybrid_test.f90 -openmp
MPI Intel, C++:        [username@rclogin02 ~]$ mpicxx -o hybrid_test.x hybrid_test.cpp -openmp
MPI GNU, Fortran 90:   [username@rclogin02 ~]$ mpif90 -o hybrid_test.x hybrid_test.f90 -fopenmp
MPI GNU, C++:          [username@rclogin02 ~]$ mpicxx -o hybrid_test.x hybrid_test.cpp -fopenmp

Running the program

You could use the following SLURM batch-job submission script to submit the job to the queue:

#!/bin/bash
#SBATCH -J hybrid_test
#SBATCH -o hybrid_test.out
#SBATCH -e hybrid_test.err
#SBATCH -p shared
#SBATCH -n 2
#SBATCH -c 4
#SBATCH -t 180
#SBATCH --mem-per-cpu=4G
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK --mpi=pmix ./hybrid_test.x

The OMP_NUM_THREADS environmental variable is used to set the number of threads to the desired number. Please notice that this job will use 2 MPI processes (set with the -n option) and 4 OpenMP threads per MPI process (set with the -c option), so overall the job will reserve and use 8 compute cores. If you name the above script omp_test.batch, for instance, the job is submitted to the queue with

sbatch omp_test.batch

Upon job completion, job output will be located in the file hybrid_test.out with the contents:

 MPI rank:           0  with thread ID:           0
 MPI rank:           0  with thread ID:           1
 MPI rank:           0  with thread ID:           2
 MPI rank:           0  with thread ID:           3
 MPI rank:           1  with thread ID:           0
 MPI rank:           1  with thread ID:           1
 MPI rank:           1  with thread ID:           2
 MPI rank:           1  with thread ID:           3
]]>
14121