Parallel Computing – FASRC DOCS https://docs.rc.fas.harvard.edu Mon, 01 Dec 2025 15:35:35 +0000 en-US hourly 1 https://wordpress.org/?v=6.9 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png Parallel Computing – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 Parallel Computing https://docs.rc.fas.harvard.edu/kb/parallel-computing/ Thu, 09 Oct 2025 18:19:20 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28920 Description

Parallel computing is a computational approach that divides large problems into smaller tasks that can be executed simultaneously across multiple processing units. On the FASRC cluster, parallel computing enables researchers to solve complex computational problems that would be impractical or impossible on single-core systems by leveraging distributed memory architectures, shared memory systems, and specialized high-performance I/O libraries.

The FASRC cluster supports multiple parallel computing paradigms including:

  • Distributed Memory Computing: Message Passing Interface (MPI) for multi-node parallel processing
  • Shared Memory Computing: OpenMP for single-node multi-core parallelization and embarrassingly parallel job arrays
  • High-Performance I/O: Parallel HDF5, NetCDF, and ScaLAPACK for efficient data handling in large-scale computations

Usage

Parallel computing tools on the FASRC cluster are available through the module system. Load the appropriate modules based on your parallel computing paradigm.  Search for available modules with module spider.

Common Troubleshooting

  • Module Conflicts: Use module purge before loading new module combinations
  • MPI Version Mismatch: Ensure compilation and runtime use identical MPI implementations
  • Memory Issues: Monitor memory usage with sacct -j JOBID --format=MaxRSS
  • Performance Bottlenecks: Use profiling tools like Intel VTune or GNU gprof

Training

Examples

For an assortment of helpful examples used in our training videos, check out our User Codes/Parallel_Computing.

References

]]>
28920
R Parallel https://docs.rc.fas.harvard.edu/kb/r-parallel/ Tue, 19 Aug 2025 20:59:30 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28987 Description

Here, we briefly explain different ways to use R in parallel on the FASRC Cannon cluster.  The best place for information on R Parallel is our training session:

Parallel computing may be necessary to speed up a code or to deal with large datasets. It can divide the workload into chunks and each worker (i.e. core) will take one chunk. The goal of using parallel computing is to reduce the total computational time by having each worker process its workload in parallel with other workers.

Usage

Request an interactive node

salloc -p test --time=0:30:00 --mem=4000

Load required software modules.

# Compiler, MPI, and R libraries
# Use `module spider NAME` to find the correct version 
module load gcc/x.x.x openmpi/x.x.x R/x.x.x

Examples

User Codes has a summary of R parallel packages that can be used on Cannon. You can find a complete list of available packages at CRAN.

Processing large datasets

Single-node, multi-core (shared memory)

Multi-node, distributed memory

Hybrid: Multi-node + shared-memory

Using nested futures and package future.batchtools, we can perform a multi-node and multi-core job.

Resources

]]>
28987
MPI IO https://docs.rc.fas.harvard.edu/kb/mpi-io/ Wed, 26 Mar 2025 16:14:15 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28570 Introduction

MPI IO is a component of the Message Passing Interface standard that enables parallel input/output operations in distributed computing applications. It allows multiple processes to collectively access shared files, optimizing disk operations through features like collective I/O and file views that reorganize data access patterns. This capability is crucial for high-performance computing applications working with large datasets across multiple nodes, as it extends parallel processing efficiency to the I/O subsystem.

Installation

Utilizing MPI IO in your program depends on your programming language. Check your programming language’s documentation for details.

Using MPI I/O

Your programming language of choice should have an implementation for these basic functions from the complete list of functions. Each of the functions has a C and Fortran usage example in the docs, as well as some parameter information and common errors.

Best Practices

On Cannon <insert storage best practices for high IO jobs >.

Slurm < insert best practices for CPU and GPU >.

Examples

For associated MPI IO examples, head over to  User_Codes/Parallel_Computing/MPI_IO.

Resources

]]>
28570
MPI (Message Passing Interface) & OpenMPI https://docs.rc.fas.harvard.edu/kb/mpi-message-passing-interface/ Fri, 31 Jan 2025 18:49:19 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28106

Introduction

The Message Passing Interface (MPI) library allows processes in your parallel application to communicate with one another by sending and receiving messages. There is no default MPI library in your environment when you log in to the cluster. You need to choose the desired MPI implementation for your applications. This is done by loading an appropriate MPI module. Currently the available MPI implementations on our cluster are OpenMPI and Mpich. For both implementations the MPI libraries are compiled and built with either the Intel compiler suite or the GNU compiler suite. These are organized in software modules.

Installation

MPI has many forms, we’ll list a few here, and also look at User_Codes/Parallel_Computing/MPI.

mpi4py with Python

We recommend the Mambaforge Python distribution. The latest version is available with the python/3.10.9-fasrc01 software module. Since mpi4py is not available with the default module you need to install it in your user environment.

The most straightforward way to install mpi4py in your user space is to create a new conda environment with the mpi4py package. For instance, you can do something like the below:

module load python/3.10.12-fasrc01 
mamba create -n python3_env1 python numpy pip wheel mpi4py
source activate python3_env1

This will create a conda environment named python3_env1 with the mpi4py package and activate it. It will also install a MPI library required by mpi4py. By default, the above commands will install MPICH.

For most of the cases the above installation procedure should work well. However, if your workflow requires a specific flavor and/or version of MPI, you could use pip to install mpi4py in your custom conda environment as detailed below:

Load compiler and MPI software modules:

module load gcc/12.2.0-fasrc01
module load openmpi/4.1.5-fasrc03

This will load OpenMPI in your user environment. You can also look at our user documentation to learn more about software modules on the FAS cluster.

Load a Python module:

module load python/3.10.12-fasrc01

Create a conda environment:

mamba create -n python3_env2 python numpy pip wheel

Activate the new environment:

source activate python3_env2

Install mpi4py with pip:

pip install mpi4py

For example code, see Parallel_Computing/Python/mpi4py

OpenMPI with GNU Compiler

If you want to use OpenMPI compiled with the GNU compiler you need to load appropriate compiler and MPI modules. Below are some possible combinations, check module spider MODULENAME to get a full listing of possibilities.

# GCC + OpenMPI, e.g.,
module load gcc/13.2.0-fasrc01 openmpi/5.0.2-fasrc01

# GCC + Mpich, e.g.,
module load gcc/13.2.0-fasrc01 mpich/4.2.0-fasrc01

# Intel + OpenMPI, e.g.,
module load intel/24.0.1-fasrc01 openmpi/5.0.2-fasrc01

# Intel + Mpich, e.g.,
module load intel/24.0.1-fasrc01 mpich/4.2.0-fasrc01

# Intel + IntelMPI (IntelMPI runs mpich underneath), e.g.
module load intel/24.0.1-fasrc01 intelmpi/2021.11-fasrc01

For reproducibility and consistency it is recommended to use the complete module name with the module load command, as illustrated above. Modules on the cluster get updated often so check if there are more recent ones. The modules are set up so that you can only have one MPI module loaded at a time. If you try loading a second one it will automatically unload the first. This is done to avoid dependencies collisions.

There are four ways you can set up your MPI on the cluster:

  • Put the module load command in your startup files.
    Most users will find this option most convenient. You will likely only want to use a single version of MPI for all your work. This method also works with all MPI modules currently available on the cluster.

  • Load the module in your current shell.
    For the current MPI versions you do not need to have the module load command in your startup files. If you submit a job the remote processes will inherit the submission shell environment and use the proper MPI library. Note this method does not work with older versions of MPI.

  • Load the module in your job script.
    If you will be using different versions of MPI for different jobs, then you can put the module load command in your script. You need to ensure your script can execute the module load command properly.

  • Do not use modules and set environment variables yourself.
    You obviously do not need to use modules but can hard code paths. However, these locations may change without warning so you should set them in one location only and not scatter them throughout your scripts. This option could be useful if you have a customized local build of MPI you would like to use with your applications.

Parallel HDF5

Parallel HDF5 (PHDF5) is the parallel version of the HDF5 library. It utilizes MPI to perform parallel HDF5 operations. For example, when an HDF5 file is opened with an MPI communicator, all the processes within the communicator can perform various operations on the file. PHDF5 supports file operations such as file create, open and close, as well as dataset operations such as object creation, modification and querying, all in parallel using MPI-IO. User Codes has examples are intended to illustrate the use of PHDF5 on the Cannon cluster. The specific examples are implemented in Fortran, but the could be easily translated to C or C++.

Video Training

Examples

For associated MPI examples, head over to  User_Codes/Parallel_Computing/MPI.

  • Example 1: Monte-Carlo calculation of π
  • Example 2: Integration of x2 in interval [0, 4] with 80 integration points and the trapezoidal rule
  • Example 3: Parallel Lanczos diagonalization with reorthogonalization and MPI I/O

Resources

]]>
28106
Job Efficiency and Optimization Best Practices https://docs.rc.fas.harvard.edu/kb/job-efficiency-and-optimization-best-practices/ Mon, 24 Jun 2024 18:53:02 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27060 Overview

The art of High Performance Computing is really the art of getting the most out of the computational resources you have access to. This applies to working on a laptop, to working in the cloud, or working on a supercomputer. While this diversity of different systems and environments may seem intimidating, in reality there are some good general rules and best practices that you can use to get the most out of your code and the computer you are on.

As defined in our Glossary, the term job has a broad and a narrow sense. In the broad sense, a job is an individual run of an application, code, or script; and may be used interchangeable with those terms. This includes whether you run it from the command line, cronjob, or use a scheduler. In the narrow sense, a job is an individual allocation for a user by the scheduler. It is usually obvious from the context which is meant.

By Job Efficiency, we mean that the parameters of the job in terms of cpu, gpus, memory, network, time, etc. (refer to Glossary for definitions) are accurately defined and match what the job actually uses. As an example, a job that asks for 100 cores but only uses 1 is not efficient. A job that asks for 100GB and uses 99GB is efficient.  Efficiency is a measure of how well the user has scoped their job so that it can run in the space defined.

Finally Job Optimization means to make the job run at the maximum speed possible with the least amount of resources used. For example, a poorly optimized code may only use 50% of the gpu it was allocated, whereas a well optimized one could use 100% and see acceleration commensurate with that improved usage. Similarly, a poorly optimized code may use 1TB of memory, but a well optimized code may only use 100GB. Optimization is a measure of how well structured a code is numerically, both in terms of algorithm and implementation, so that it can get to the solution in the fastest, most accurate, and most economical way.

Efficiency and Optimization are thus two sides of the same coin. Efficiency is about accurately defining the resources that you will use and optimization is about reducing that usage. Both have the goal of getting the most out of the resources the job is using.

Architecture

Schematic showing how cores, memory, and nodes are arranged on the cluster.
Schematic showing how cores, memory, and nodes are arranged on the cluster.

Before we get into talking about Job Efficiency and Optimization, we should first discuss general cluster architecture. Supercomputers (aka clusters) are essentially a bunch of computers of similar type networked together by a high speed interconnect so that they can act in unison with each other and have a common computational environment. The fundamental building block of a cluster is a node. Each node is composed of a bunch of cores that all talk to the same block of memory. GPU nodes have in addition to cores and memory, GPU’s which can be used for specialized workflows such as machine learning. The nodes are then strung together with a network, typically Infiniband, and then a scheduler is put in front to handle what jobs get what resources.

CPU/GPU Type

Typically a cluster will be made up of a uniform set of hardware. However that is not necessarily the case. At FASRC we run a variety of hardware, spanning multiple generations and vendors. These different types of CPU and GPU have various performance characteristics and features that may have impacts on your job. We will talk about this later but being cognizant of what hardware your job works best on will be important for efficient and optimal use of the cluster.  At FASRC we split up our partitions such that each partition has a uniform set of hardware, unless otherwise noted (e.g. gpu_requeue and serial_requeue). A comprehensive list of available hardware can be found in the Job Constraints section of the Running Jobs page. You can learn more about a specific node’s hardware by running scontrol show node NODENAME.

Network Topology

How the nodes are interconnected on the Infiniband is known as the topology.  Topology becomes very important as you run larger and larger jobs, more on that later. At FASRC we generally follow a Fat Tree Infiniband topology, with a hierarchy of switching and adjacent nodes being close to each other network-wise.

We name our nodes after their location in the datacenter, which makes it easy to figure out which nodes are next to each other both in terms of space and in terms of network. Our naming convention for nodes is Datacenter/Node has a GPU or Not/Row/Pod/Rack/Chassis/Blade. So for example the node holy7c04301 is at our Holyoke Data Center in Row 7 Pod C Rack 04 Chassis 3 Blade 01, the adjacent nodes would be holy7c04212 (which is the last blade on the chassis below Chassis 3 as there are 12 blades per chassis for this hardware) and holy7c04302. Another is example is holygpu8a11404, this node is a GPU node in our Holyoke Data Center in Row 8 Pod A Rack 11 Chassis 4 Blade 4, the adjacent nodes would be holygpu8a11403 and holygpu8a11501 (which is the next blade on the chassis above Chassis 4 as there are only 4 blades per chassis for this hardware). You can see the full topology of the cluster by doing scontrol show topology.

Job Efficiency

The first step in improving job efficiency is understanding your job as it exists today. Understanding means you have a good handle on the resource needs and characteristics of your job, and thus you are able to accurately allocate resources to it thereby improving efficiency. As a general rule, one should always understand the jobs you run regardless of size. This knowledge is both beneficial for right sizing your requested resources, but also for noticing any pitfalls that may occur when scaling the job up.

There are two ways of learning about your job. The first is to have a fundamental understanding of the job you are running. Based on your knowledge of the algorithm, code, job script, and cluster architecture, you know what you should request for core count, gpu count, memory, time, and storage. Knowing your code at this level will allow you to make the most accurate estimates for what you will use.

While a full understanding of your job is ideal, often it is not possible. You may not control the code base, you may just be getting started, or you may not have time to obtain a deep understanding of the job. Even in cases where you have a good theoretical understanding of your job, you need to confirm that knowledge with hard data. In which case the second method is to test your job empirically and find out what the best job parameters are. Simply take an example that you know will be akin to what you will run in production and run it as a test job. Then once the test job is done, check to see how it performed. You then repeat, changing the job parameters until you have a good understanding of how your job performs in different situations.

That’s the rough sketch, but the details are a bit different depending on what you want to understand. Below are some methods for finding out how much memory, cores, gpus, time, and storage your job will need. These may not cover every job but should work for most situations.

Memory

Memory on the cluster is doled out in two different ways, either by node (--mem) or by core (--mem-per-cpu). If your job exceeds its memory request, you will see an error either containing Out of Memory or oom. This indicates that the scheduler terminated your job for exceeding your memory request. You will need to increase your memory allocation to give your job more space.

Here is a test plan for figuring out how much memory you should request:

  1. Come up with an initial guess as to how much memory your job will require. A good first guess is usually 2-3 times the size of any data files you are reading in, or 2-3 times the size of the data you will be generating. If you do not know either of those then a safe initial guess is 4GB. Most of the cluster has 4GB per core, so its a good initial guess that will allow you to get through the scheduler in a quick manner.
  2. Run a test job on the test partition with your guess.
  3. Check the result of your run using jobstats or sacct.
  4. If your job ran out of memory then double the amount and return to step 2.  If it ran properly (i.e, no out of memory error), then look at how much your job actually used and update your request to match with an additional 10% buffer as the scheduler’s sampling of the memory usage runs every 30 seconds and it may have missed any short term memory spikes.

Every time you change a parameter in your code, you should check to see how the memory changes. Some parameters will not change the memory usage at all. Others will change it dramatically.  If you do not know if a parameter will change the memory usage run a test to see how it behaves.

If you are working to scale up a job its good to understand how your memory usage will scale as the job increases. For example, say you are running a three dimensional code and you increase the resolution of the box you are simulating by 2. That means that your memory usage will grow by a factor of 8 because each dimension grew by a factor of 2. Likewise if you are running a simulation that ingests data, it will likely scale linearly with the amount of data you ingest. Testing by increments is the best way to validate how your memory usage will grow depending on situation.

One important warning is to make sure to use the correct memory for each type of job. When your job runs, the scheduler blocks off a segment of memory for you to use, regardless of if you actually use it. If your job asks for 100GB but only uses 1GB, the scheduler will give you 100GB and your fairshare will also be charged for that 100GB. In addition, if you had asked for 1GB your job may have been better able to fit into the gaps in the scheduler, as 1GB of memory is easier to find than 100GB. Efficient use of the cluster means selecting the right amount of memory for whatever job you are running at that time. A quick way to spot if you have any jobs with incorrect memory settings is to use the seff-account command which will plot a histogram of your job memory efficiency over a specified period.

Cores

Slurm does not automatically parallelize jobs. Even if you ask for 1000’s of cores, if your job is not set up to run in parallel, your job will just run on a single core and the other cores will remain idle. Thus when in doubt about your code, err on the side of asking for a single core and then check the code’s documentation or contact the primary author to find out whether it is parallel and what method it uses.

Broadly parallel applications fall into two categories: thread based and rank based. Thread based parallelism relies on a shared memory space and thus is constrained to a single node. This includes things like OpenMP, pthreads, and python multi-processessing. Rank based parallelism relies on individual processes that have their own dedicated memory space which communicate with each other to share information. The main example of this is MPI (Message Passing Interface). It is important to understand which method your job uses as that will make a difference how you ask for resources in Slurm and how many cores you can reasonably ask for.

Once you figure out if your code is thread based or rank based, you can then do a scaling test to figure out how your code behaves as you add more cores.  There are two types of scaling tests you can do, both test slightly different parts of your code. The first type is called strong scaling. In this test, you keep the size of the problem the same while increasing the number of cores you use. In an ideal world your job should go twice as a fast every time you double the amount of cores you use. Most codes though do not have ideal scaling. Instead various inefficiencies in the algorithm or the size of the job itself mean that there is a point of diminishing returns where adding more cores does not improve speed.  Typically when you plot a chart of strong scaling you will see:

Chart showing strong scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Strong Scaling Plot. Plot is Log-Log with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run their code with more than 256 cores because after that point adding more cores has diminishing returns with respect to improving performance.

The second type is called weak scaling. In this test you increase the size of the job proportional to the number of cores asked for. So if you double the cores, you would double the job size. Job size in this case is the amount of total computational work your job does. For instance you might double the amount of data ingested (assuming your code computational needs increase linearly as you increase the data ingested) or double one of the dimensions of multidimensional grid.  In an ideal world, your job should take the same amount of time to run if the job size grows linearly with the core count.  Most codes though do not have this ideal scaling. Instead various communications inefficiencies or nonlinear growth in processing time can impact the performance of the job and thus adding more cores would be inefficient beyond a certain point. A typical plot for weak scaling looks like:

 

Chart showing weak scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Weak Scaling Plot. Plot is Log-Linear with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run this job with more than about 1000 cores as, after that point, the run time grows substantially from the ideal.

Besides these more robust scaling tests, you can get a quick view of your job core usage efficiency by using the jobstats or seff-account command. Those commands will take the ratio of two numbers. The first number is how much time you actually used of the cores for the job (t_cpu), this is known as the system CPU time. Note that for historical reasons CPU and Core are used interchangably with respect to CPU time. Regardless of the name what is meant is the amount of time that the system detects as being spent computing on a specific set of cores.  The second number is your elapsed run time multiplied by the number of cores (t_elapsed*n). If your job scales perfectly, your CPU efficiency (t_cpu/(t_elapsed*n)) will be 100%. If it is less than 100%, then the ratio of that will be roughly the number of cores you should reduce your job by. So say your job uses 8 cores, but you have an efficiency of 50% in jobstats, then you should reduce your ask to 4 cores instead. This is also a good way to check quickly if your job is scaling at all as if you see your job only using one core you know that either your job is not parallel or alternatively something is wrong and you need to investigate why your job is not scaling.

Time to Science (TtS)

With these two tests you can figure out the maximum number of cores you should ask for. That said, even if your core scales perfectly you will probably not want to ask for the maximum number of cores you can.  The reason for this is that the more cores you ask for the longer your job will pend in the queue, waiting for resources to become available. Time to Science (TtS) is the sum of the amount of time your job pends for plus the amount of time your job runs for. You want to minimize both. Counterintuitively, it may be the case that asking for less cores will mean your job will pend for substantially shorter, enough to make up for the loss in the run’s speed.

As an illustration, say your code scales perfectly and your job of 256 cores will take 1 day to run.  However it turns out that you will be spending 2 days pending in the queue waiting for your job to launch, thus your total TtS is 3 days. After more investigation you find out that if you ask for 128 cores, your job will take 2 days to run but the scheduler will be able to launch it in 4 hours leading to TtS of 2.25 days. You can see that the 128 core job was “faster” than the 256 core job, simply due to the fact that the 128 core job fit better in the scheduler at that moment.

It should be noted that the scheduler state is fluid and thus one should inspect the queue before submitting. You can test when your job is scheduled to run by adding the --test-only flag to your sbatch line, that will cause the scheduler to print back when it thinks the job will execute. This is a good way of right sizing your job.

Topology

For certain codes, layout on the node (i.e. which cores on which CPU) and cluster (i.e. where the nodes are located relative to each other) matters. In these cases the topology of the run is critical to getting the peak speed out of the job. Without deep knowledge of the code base, it’s hard to know if your code is one of these codes, and in most cases your code is not.

In cases where the topology of the run matters, Slurm provides a number of options to require the scheduler to give you a certain layout for the job. Both the sbatch and srun commands have options to this effect. Note that the more constraints you add to a job the longer it will take the scheduler to find resources for your job to use. One should set the minimum necessary restrictions on a job to give the scheduler maximum flexibility. As before, it may be the case that you may see a significant speed up if given the right topology but if it comes at the cost of having to wait significantly longer to run, your TtS may actually not improve or even get worse.

GPUs

Many of the same rules that applied to cores also apply to GPU’s. For most codes, your job will use a single GPU. If your code uses multiple GPU’s then you can follow the same process as above for cores to see how your code scales.  Aside from jobstats you can also use other tools like DCGM and nvtop to get statistics on how your job is doing.

Time

It should be stated upfront that Slurm does not charge you fairshare for time you do not use. If you ask for 3 days and only use 2 hours, the scheduler will only charge you for the 2 hours you actually used. This is different than Memory, Cores, and GPU’s where you will be charged for allocating those resources whether you use them or not as the scheduler had to block them off for you to use and could not give them to anyone else.

To accurately estimate time is important not for the sake of fairshare, but rather for the sake of scheduling. The scheduler only knows what you tell it; if you tell it that a job takes 3 days, it will assume it takes 3 days even if it really takes 2 hours.  Thus when the scheduler goes to consider the job for scheduling, it will look for an allocation block the size of the length of job you request. A more accurate time estimate means that the scheduler can fit your job into tighter spots in the giant game of Tetris it is playing.  Taking our previous example, it may be that there are no spots right now for a 3 day job, but a 2 hour job may run immediately because there is a gap that the scheduler can fit it into while waiting to schedule a large high priority job. This behavior is called backfill, and is one of the two loops the scheduler engages in when scheduling. Leveraging the backfill loop is important as it is the main method through which low priority jobs, even those with zero fairshare, get scheduled. You can leap frog ahead of higher priority jobs because your job happens to fill a gap.

Assuming you are running on the same hardware (for considerations regarding different types of hardware see the next section) then you can reliably predict the runtime for certain classes of jobs. Simply run a test job and then look at how long it took using sacct or jobstats. If you run a bunch of jobs you can use seff-account to see the distribution of run times. Once you have the runtime, round it up the nearest hour and that should cover most situations. Run times can vary for various reasons but typically not more than 10%, so if your job takes 10 hours, you should ask for around 12 hours.

Finally a word about minimum run times. As described above your goal is to minimize Time to Science (TtS). You may naively think that asking for very short amounts of time would decrease TtS even more, but this is incorrect. The scheduler takes time to actually schedule jobs no matter how small your job is. To put it bluntly, you do not want the scheduler doing more work to schedule your job than your actual job is doing. For super short jobs the scheduler can get into a thrashing state where it schedules a job, the job exits immediately, and then the scheduler has to fill that slot again, similar to trying to fill a tub with the drain open. To prevent this, we require jobs to run for at least 10 minutes. Ideally jobs would last for an hour or longer. Thus when you are doing work on the cluster try to make sure you batch in increments of longer than 10 minutes and ideally longer than an hour. This will help the scheduler, and make sure your TtS is as short as possible.

Hardware

For similar job types, the run time is usually the same, with an important caveat being that you need to run on the same hardware. Different types of cores and GPU’s have different capabilities and speeds.  It is important to know how your job behaves as you switch between them. We have a table of relative speeds on the Fairshare page. It should be noted that that table only applies if your code is fully utilizing the hardware in question (more on that in the optimization section), you should always test your code to see how it actually performs as certain CPU and GPU types may work better for your code than others despite what the officially advertised benchmarks say.  While we generally validate vendor advertised performance numbers, they only apply to heavily optimized codes designed for those specific chips, as such your code speed may vary substantially.

If your are submitting to gpu_requeue or serial_requeue you will notice that your run times will vary quite a bit. This is because gpu_requeue and serial_requeue are mosaic partitions with a wide variety of hardware and thus a wide variety of performance. In cases like that you can either be very specific about which type of hardware you want using the --constraint option, or you can simply increase your time estimate to be the maximum you expect it to take on the slowest hardware available. A good rule of thumb is a factor of three variance in speed. So if your job takes 3 hours on most hardware, give it 9 hours on serial_requeue as you may end up on a substantially slower host.

Storage

The final thing that can impact Job Efficiency is the storage you use. Nothing can drag down a fast code faster than slow IO speed (Input/Output). To select the right storage, please read our Data Storage Workflow page. In general, for jobs you will want to use either Global Scratch or Local Scratch. If your job is IO heavy (i.e. it is constantly talking to the storage), Local Scratch is strongly preferred. Please also see the Data Storage Workflow page for how to best lay out your file structure, as file structure layout can impact job performance as well.

Job Optimization

Now that we have dealt with Job Efficiency the next thing to look at is Job Optimization, after all the only way to improve your Time to Science (TtS) and increase your code capability after properly structuring your job is to improve the code itself. Job optimization can be very beneficial but can also take significant time. There are in general three methods to optimize your code, each taking different amounts of time.

  1. Compiler Version, Library Version, Containers and Optimization: Compilers, Libraries, and Containers are code as well and thus subject to improvement. Simply changing or updating your compiler, libraries, and container can sometimes lead to dramatic increases in performance. In addition compilers have different optimization flags you can use that will automatically optimize your code. This option is the fastest way to get optimized code as all the work is already done for you, you just need to select the right compiler, libraries, container, and options.
  2. Partial Code Rewrite: Looking through the code as it exists now and reworking portions can create speed ups. The process of reworking your code consists of finding places where the code is spending significant time and then refactoring the code; either by updating the logic, replacing the numerical method, or by substituting an optimized library. This process can take a few weeks to months but can give substantial increases in speed. However, this method cannot fix the basic structural problems with the code.
  3. Full Code Rewrite: This can could take a significant amount of time depending on the complexity of the code (for large codes this can take up to six months to a year to complete) but is the best way to optimize your code. It will allow you to fundamentally understand how your code operates and fix any major structural problems resulting in transformative increases in performance. If you go this route you should try the other two options first; as when you do the first two steps you will have a good understanding of the quirks of your code. You should also do a cost benefit analysis to figure out if the time spent is worth the potential gains. Make sure the project has a firm end goal in mind. If your code needs continual improvement, it may be time to hire a Research Software Engineer to do that very important and necessary code development work.

Regardless of the method, you will need to grow more acquainted with your code, its numerical methods, and how it interacts with the underlying hardware. While there are some generalized rules and things to look for when optimizing code, in the end it will depend on you turning your code from a blackbox into something you understand at a fundamental level. This is also where learning how to use various debuggers and code inspectors can be very beneficial as they can help identify which portions of the code to focus on.

Important final note: Always reconfirm the results of your code whenever you change your optimization. This goes for any changes to your code, but especially when you recompile with different compilers, optimization levels, libraries, etc. You should have a standard battery of tests you know the results of that you can run to reconfirm that the results did not change, or if they did they are acceptable changes. Optimization can change the numerical methods and order of operations leading to numerical drift. Sometimes that drift is fine as its at the edge of the mantissa. Sometimes though those changes at the edge of the mantissa can build and lead to substantial changes in results. Even if your code is confirmed as working, always have a healthy suspicion of your code results and engage in independent verification as code bugs and faults can produce results that look legitimate but were arrived at by faulty logic or code.

Below we are going to give some general rules regarding optimization as well as suggestions as to different ways to go about it.

Compiler Optimization

Compiler optimization means letting the compiler look through the code for things it can improve automatically. Maybe it will change the memory layout to make it more optimal, maybe it will notice that you are doing a certain numerical technique and then substitute in a better one, maybe it will change the order of operations to improve numerical speed. Regardless of what it tries, compiler optimization relies on the authors of the compiler and their deep knowledge of numerical methods and the underlying hardware to get improved speed. Compiler optimization really applies to those compiling from C, C++, or Fortran but higher level codes like Python and R, which lean on libraries that are written in C, C++, and Fortran can also benefit. Thus if you want to really optimize your Python or R code, getting those underlying libraries built in an optimized way can lead to speed ups.

There is a generally agreed upon standard for most compilers with regard to level of optimization. After all not all optimizations are numerically safe, or will produce gains in speed. Some may in fact slow things down. As such when using compiler optimization test your code speed and accuracy at different levels of optimization, and with different compilers. Each compiler has a different implementation of the standard, some are better for certain things than others. It is also worth reading the documentation for the compiler optimization levels to see what is included. A good exercise is to take each individual flag that makes up an optimization level and test to see if it speeds up your code and if it introduces numerical issues.

The standard code optimization (-O) levels are:

-O0: Not optimized at all.  The compiler just runs your code as is with zero work done. If you turn on debugging typically this is what your code will default to.

-O1: Numerically safe optimization. This level of optimization is guaranteed to be numerical stable and safe. No corners are cut, no compromises in numerical precision are made, nothing is reordered.

-O2: Mostly numerically safe optimization. This level of optimization is the default level for most compilers. At this level, in most cases, the optimizations made are numerically okay. Generally there is no sacrifice of numerical precision, though loops may be unrolled and reordered to make things more efficient.

-O3: Heavily optimized. This level of optimization takes the approach of trying to include every possible optimization whether numerically safe or not.

As you can see the various levels of optimization make certain assumptions about how numerically safe it is trying to be. Given this, you should always test your code to make sure that it runs as it should after compilation and does not produce errant results.

One other common optimization is to leverage special features found with different chipsets. Each generation of CPU has different features built into it that you can leverage. Some example features are SSE (Streaming SIMD Extensions), FMA (Fused Multiply Add), and AVX (Advance Vector Extensions). If your code is architected to use them, you can gain substantial speed by enabling these optimizations. There are three ways to do this:

  1. Turn on each feature individually: This allows you to pick and choose which you want and makes your compiled code portable across different chipsets.
  2. Specify chipset you are building for: Compilers include flags that allow you to target a specific type of chip and include all the relevant optimizations for it.  This approach works well if you have a uniform set of hardware you are running on, or if you are not sure what features your code will leverage. Note your code will not work on other chipsets.
  3. Have the compiler autodetect what chipset you are using: Compilers usually have a flag (i.e. -xHost) that will detect the chipset you are currently on and then build specifically for that. To do this properly you will need to make sure you are on the node that is of the same type that you will run your code on. In addition your code will not be portable.

It is worth noting again that not all optimizations are safe or beneficial. Heavily optimized code can lead to substantial bloat in memory usage with little material gain. Numerical issues may occur if the compiler makes bad assumptions about your code. You should only use up to the level of optimization that is stable and beneficial and no higher. If an optimization has no impact on your code performance, it is best to leave it off.

Important final note: Always remove debugging flags and options when running in production. Debugging flags will disable optimization even you tell the compiler to optimize, as the debugging flag overrides the optimization flag. Before going to production, remove debugging flags, recompile, and test your code for accuracy and performance.

Languages, Numerical Method, and Libraries

Selecting the correct language, numerical method, and libraries are important parts of code optimizations. You always want to select the right tool for the job. For some situations Python is good enough, for others you really need Fortran. An improved numerical method may give enormous speed ups but at the cost of increased memory, or vice versa. Swapping out code you wrote for a library maintained by a domain expert may be faster, or having a more integrated code may end up being quicker.

With languages, you are usually locked into a specific one unless you do a complete code rewrite. As such, you should learn the quirks of the language you are using and make sure your code conforms to the best standards and practices for that language. If you are looking to rewrite your code, then consider changing which language you are using. It may be that a different language may lead to more speed ups in the future. As a general rule, languages that are closer to the hardware (things like C or Fortran) can be made to go faster, but they also are trickier to use.

For numerical methods you will want to stay abreast of the current literature in your field and the field that generated the relevant numerical technique (e.g. matrix multiplication, sorting). Even small changes to a numerical technique can add up to large gains in speed. They can also dramatically impact memory utilization. Simplicity is also important, as in many cases a simpler method is faster just by dint of having to do less math and logic. This is not true in all cases though, so be sure to test and verify.

Libraries are another important tool in the toolbox. By using a library you leverage someone else’s time and experience to write optimized code. This saves you from having to debug the code and optimize it, you simply plug in the library and go. Libraries can still have flaws though, so you want to make sure you keep up to date and test. If you do find flaws you should contribute back to the community (i.e., report to the library’s developers) so that everyone benefits from the improvement you suggest. One other caution with libraries is that sometimes it is better to inline the code rather than go to a library, as the gains from using the library may not outweigh the cost of accessing the library. Libraries will not automatically make your code faster, but rather are a tool you can use to potentially get more speed and efficiency.

Containers

Probably the ultimate form of library is a well maintained container. Well optimized containers have the advantage of providing a highly customized stack of optimized libraries that will allow the code to get to near its peak performance. Containers are powerful tools for especially complex software stacks, as the container can provide optimization for each individual element of the stack and ensure that all the various versions interoperate properly. Containers are not free though and do have some performance overhead, due to having an abstraction layer between the software in the container and the system hardware. For peak performance you will want to build your software stack outside of the container in the native environment. In most cases though, the performance penalties are minimal and substantially outweighed by the performance gains of using a well maintained and well optimized software stack that you do not have to build yourself.  For best performance look for containers that are provided by the hardware vendor. For instance Nvidia provides a well curated list of containers built for its GPU’s. The vendor typically has the best knowledge of the internals of their hardware and thus will know how to get the most out of it. Containers provided by primary code authors are also good sources as the code author will have the best knowledge of the internals of their code base and how to best run it.

Containers can also be handy for users dealing with operations or code bases involving many files. By including these files in the container itself it effectively hides them from the underlying storage, the storage treats it as one large single file rather than lots of smaller files. Filesystems in general behave best when interacting with single large files as traversing between files is expense, especially when there are many files to deal with. Thus if your workflow either has a software that has many files, like a Python/Anaconda/Mamba environment, or your code is engaging in IO with many files, consider putting them inside a container.

Other General Rules

Here are some rules that did not fit into other sections but are things you can look for when optimizing your code.

  1. Remove Debugging Flags and Options: Cannot emphasize this enough. Production code should not be run in debug mode as it will slow things down substantially.
  2. Use the latest compilers and libraries: Implied above, but one of the first things to try is updating your compiler and library versions to see if the various improvements to those codes improve your code performance.
  3. Leave informative comments in your code: Comments are free and having good comments can help you understand your code and improve it. A very good practice is to cite the paper and specific equation or analysis you are using so you can find the original context.
  4. Make sure your loops are appropriately ordered for your arrays: Different languages have different array ordering as to which index is fastest to traverse in memory (for instance Fortran orders its arrays with the first index being fastest, in C it is the opposite). Be aware of this and arrange your arrays and loops appropriately.
  5. Avoid if statements buried in loops: if statements are not free and cost time to execute, thus it is best if you can execute it once rather than all the time.
  6. Use temporary variables to hold constants: Multiplications are faster than divisions or exponents. Thus, instead of pi/2, use 0.5*pi; instead of 5^2, use 5*5. In addition, if you have a complicated coefficient you are multiplying or dividing by repeatedly, consider calculating that coefficient once and storing that as a temporary variable. If your coefficients are related to each other by some constant value, also consider making that a constant. For instance if you are always using 4*pi/3, store that as a variable, and then use that in place of it where ever it appears.
  7. Use the right type, size, and level of precision for variables: Integer math is faster than floating point math. Single precision math is faster than double precision. 4 byte integers use up less space than 8 byte integers. Select the size and type necessary for the numerical precision and accuracy you require and no larger.
  8. If you have a heavy arithmetic section consider using small temporary arrays for the data you are manipulating: Long strings of math in a single line are hard for the compiler to optimize, and also trend towards mistakes. Consider breaking it up in to smaller chunks that eventually sum up to the total value you need. Be careful of round off error and order of operations issues with this.
  9. Lower your cache miss rate: CPU’s and GPU’s are built with onboard memory (typically called cache), you should try to keep your processing in this onboard memory and only go out to main memory when necessary. Cache is faster to access and generally small, so doing things in smaller chunks that reuse data will be more likely to drop in the cache layer.
  10. Be aware of first touch rule for memory allocation: Memory is typically allocated on an at-need basis, and the further the code needs to search in memory, the worse the performance. Allocate frequently used arrays and variables first.
  11. Reduce memory footprint: As a general rule you want to keep your memory usage to the bare minimum you need. The more temporary arrays and variables you keep the more memory bloat your code will have.
  12. Avoid over abstraction: Pointers are useful, but pointers to pointers to pointers are not. It makes it hard for the compiler to optimize and for you (and anyone that uses your code) to follow the code.
  13. Be specific and well defined: A well structured code is easier to optimize. Declare all your variables up front, allocate your arrays as soon as you can, do not leave the variable types ambiguous.
  14. Work in Memory and Not on Disk: Accessing storage, no matter how fast, takes far more time than accessing memory. Try to only read and write to storage when necessary. If possible spin off a separate process to handle reading and writing to disk so that your main process can continue work.
  15. Avoid large numbers of files: It is better from a IO performance standpoint to have lower numbers of large files on disk rather than many small files. Bundle your data together into larger files that you read from or write to all at once.
  16. Include Restarts/Checkpoints: Include the ability for your code to pick up from where it left off by writing restart/checkpoint data to disk. The restart data should be only what is sufficient to pick up from where your calculation left off. This will allow you to recover from crashes and leverage the requeue partitions. Restarts will also allow you to use partitions with shorter time limits to bridge yourself to a longer run (e.g. using ten 3 day runs to accumulate a 30 day run).

Parallelization

There are limits to how fast you can make any single code run in serial. Once this limit is hit, parallelization needs to be considered. Sometimes this parallelization is trivial, such as launching thousands of jobs at once each with different parameters to do a parameter sweep (this is known as an embarrasingly parallel workflow). However if your code needs to be tightly coupled then other methods of parallelism will need to be considered. The three main methods of parallelization are:

  1. SIMD: Singe Instruction Multiple Data
  2. Thread: Shared Memory
  3. Rank: Distributed Memory

Regardless of what method you use, the general rule is that you want to make sure as much of your code is parallelized as possible and that communications and computation are overlapped with each other.  It is also possible to use SIMD in conjunction with Threads in conjunction with Ranks, this is known as the hybrid approach. These can lead to very powerful codes that can scale up to the largest supercomputers in the world.

Some libraries and codes (for example MATLAB, PETSc, OpenFOAM, Python Multiprocessing, HDF5) will already have parallelization included. Check with the documentation and/or inquire of the developer as to if it is able to parallelize and what method it uses. Once you know that you will be able to get the most out of the built-in parallelization.

SIMD (Single Instruction Multiple Data)

Most processors have multiple channels that can execute a specific command simultaneously on a stream of data. This is built in to the chipset itself and compilers will automatically optimize code to leverage this behavior. You can intentionally design your code to better leverage it depending on which specific compiler and instruction set you are using (such as AVX).

Threads

Threading achieves parallelism by having a shared memory space but then running multiple computational streams (threads) across it to accomplish specific instructions. Thread based parallelism is typically fairly easy to accomplish as it requires no complex interprocess communications, all the changes to memory are readily visible to each thread. Typically all the coder needs to do is to indicate which loops and sections can be threaded, and the compiler takes care of the rest. Examples are OpenMP, OpenACC, Pthreads, and Cuda.

Rank

Rank based parallelism is the most powerful but also most technically demanding type of parallelism. Each process has its own memory space and the user has to manage inter process communications themselves. Key here is making sure that communications bottlenecks are minimal, and if they exist to overlap them with computation so they do not slow down code execution. The industry standard for doing this is called MPI (Message Passing Interface).

Profiling

Knowing where to focus your time for optimizing your code is important. You will gain the most speed by optimizing the part of your code that is currently occupying the most execution time, or using the most memory.  To figure this out you need to profile your code.

The easiest and most immediate way is to use print statements combined with printing how much time each section takes. Most languages have methods of printing out time stamps or calculating elapsed time, simply use those methods with judicious use of print statements and you can quickly find out where your code is spending most of its time. Generally you should instrument your code to give you overall timing estimates, especially if your code works on some sort of large loop (i.e. such as taking time steps for doing fluid dynamics). Print statements are the quickest and easiest way to get information on your code.

Besides print statements, various profilers exist that you can use to inspect your code. Profilers will give you far more information about your code, as well as suggestions as to where your code could be improved. They can give you super precise timing for your code as well as inform you what cache/memory level your code is touching. All of this rich information can be valuable for dialing in on particularly small sections of code or subtle issues that may be causing dramatic slow downs.

Below is a list of profilers you can use:

  • VTune: Intel’s profiler
  • NSight: Nvidia’s profiler
  • DCGM: Data Center GPU Manager from Nvidia
  • top: Not really a profiler but a useful system utility for monitoring live job performance.
  • nvtop: Similar to top but for gpus.

 

]]>
27060
R and RStudio https://docs.rc.fas.harvard.edu/kb/r-and-rstudio/ Fri, 07 Jun 2024 20:46:42 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27082

Description

R is a language and environment for statistical computing and graphics.   There are several options to use R on the FASRC clusters.

Of those options, the FASRC recommended method is the RStudio Server stand-alone app on Open OnDemand.

We recommend using RStudio Server on Open OnDemand because it is the simplest way to install R packages (see RStudio Server). We only recommend R module and RStudio Desktop if you:

  • plan to run mpi/multi-node jobs
  • need to choose specific compilers for R package installation
  • are an experienced user, who
    • knows how to compile software from source
    • has too much time on their hands
    • likes to take risks that often don’t payoff ( Hey, there is always RStudio on Open OnDemand, right? )

Should you require it, we offer ( “at your own risk” ):

Usage

RStudio on Open OnDemand

Use RStudio on Open OnDemand to reduce your stress and that of those around you.  Here is a short 5 minute video to get you started:

RStudio Server

RStudio Server is our go-to RStudio app because it contains a wide range of precompiled R packages from bioconductor and rocker/tidyverse. This means that installing R packages in RStudio Server is pretty straightforward. Most times, it will be sufficient to simply:

> install.packages("package_name")

This simplicity was possible because RStudio Server runs inside a Singularity container, meaning that it does not use the host operating system (OS). For more information on Singularity, refer to our Singularity on the cluster docs.

Important notes:

  • User-installed R libraries will be installed in ~/R/ifxrstudio/\<IMAGE_TAG\>
  • This app contains many pre-compiled packages from bioconductor and rocker/tidyverse.
  • FAS RC environment modules (e.g. module load) and Slurm (e.g. sbatch) are not accessible from this app.
  • For the RStudio with environment module and Slurm support, see RStudio Desktop

This app is useful for most applications, including multi-core jobs. However, it is not suitable for multi-node jobs. For multi-node jobs, the recommended app is RStudio Desktop.

FASSE cluster additional settings

If you are using FASSE Open OnDemand and need to install R packages in RStudio Server, you will likely need to set the proxies as explained in our Proxy Settings documentation. Before installing packages, execute these two commands in RStudio Server:

> Sys.setenv(http_proxy="http://rcproxy.rc.fas.harvard.edu:3128")
> Sys.setenv(https_proxy="http://rcproxy.rc.fas.harvard.edu:3128")

Package Seurat

In RStudio Server Release 3.18, the default version for umap-learn is 0.5.5. However, this version contains a bug. To resolve this issue, downgrade to umap-learn version 0.5.4:

> install.packages("Seurat")
> reticulate::py_install(packages = c("umap-learn==0.5.4","numpy<2"))

And test with

> library(Seurat)
> data("pbmc_small")
> pbmc_small <- RunUMAP(object = pbmc_small, dims = 1:5, metric='correlation', umap.method='umap-learn')
UMAP(angular_rp_forest=True, local_connectivity=1, metric='correlation', min_dist=0.3, n_neighbors=30, random_state=RandomState(MT19937) at 0x14F205B9E240, verbose=True)
Wed Jul 3 17:22:55 2024 Construct fuzzy simplicial set
Wed Jul 3 17:22:56 2024 Finding Nearest Neighbors
Wed Jul 3 17:22:58 2024 Finished Nearest Neighbor Search
Wed Jul 3 17:23:00 2024 Construct embedding
Epochs completed: 100%| ██████████ 500/500 [00:00]
Wed Jul 3 17:23:01 2024 Finished embedding

R, CRAN, and RStudio Server pinned versions

FASRC RStudio server pins R, CRAN, and RStudio Server versions to a specific date to ensure R package compatibility. Therefore, we strongly recommend using > install.packages("package_name") with no repos argument specified.

For example, in Release 3.20:

> install.packages("parallelly")
Installing package into ‘/n/home12/jharvard/R/ifxrstudio/RELEASE_3_20’
(as ‘lib’ is unspecified)
trying URL 'https://p3m.dev/cran/__linux__/noble/2025-02-27/src/contrib/parallelly_1.42.0.tar.gz'
Content type 'binary/octet-stream' length 537560 bytes (524 KB)
==================================================
downloaded 524 KB

* installing *binary* package ‘parallelly’ ...
* DONE (parallelly)

The downloaded source packages are in
‘/tmp/Rtmp1AiMaa/downloaded_packages’

Above, the package is downloaded from https://p3m.dev/cran/__linux__/noble/2025-02-27/src/contrib/parallelly_1.42.0.tar.gz Note the date 2025-02-27 and not latest.

For more details see Rocker project which is the base image for FASRC’s RStudio Server.

Advanced Installation: the R package latest version ( not recommended )

This following approach is not recommended, but if you need to build the latest version of a package from source for some reason, you may specify the repo, version, or install from github. Do note that using this approach is very tricky and will likely break R package dependencies. Please do not do this.  Kittens will explode.

repos example:

install.packages("rstan", repos = "https://cloud.r-project.org")

install_github example:

> require(remotes)
> install_github("wadpac/GGIR", ref = "3.2-10")

Use R packages from RStudio Server in a batch job

The RStudio Server OOD app hosted on Cannon at rcood.rc.fas.harvard.edu and FASSE at fasseood.rc.fas.harvard.edu runs RStudio Server in a Singularity container (see Singularity on the cluster). The path to the Singularity image on both Cannon and FASSE clusters is the same:

/n/singularity_images/informatics/ifxrstudio/ifxrstudio:RELEASE_<VERSION>.sif

Where <VERSION> corresponds to the Bioconductor version listed in the “R version” dropdown menu. For example:

R 4.2.3 (Bioconductor 3.16, RStudio 2023.03.0)

uses the Singularity image:

/n/singularity_images/informatics/ifxrstudio/ifxrstudio:RELEASE_3_16.sif

As mentioned above, when using the RStudio Server OOD app, user-installed R packages by default go in:

~/R/ifxrstudio/RELEASE_<VERSION>

This is an example of a batch script named runscript.sh that executes R script myscript.R inside the Singularity container RELEASE_3_16:

#!/bin/bash
#SBATCH -c 1 # Number of cores (-c)
#SBATCH -t 0-01:00 # Runtime in D-HH:MM
#SBATCH -p test # Partition to submit to
#SBATCH --mem=1G # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o myoutput_%j.out # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors_%j.err # File to which STDERR will be written, %j inserts jobid

# set R packages and rstudio server singularity image locations
my_packages=${HOME}/R/ifxrstudio/RELEASE_3_16
rstudio_singularity_image="/n/singularity_images/informatics/ifxrstudio/ifxrstudio:RELEASE_3_16.sif"

# run myscript.R using RStudio Server signularity image
singularity exec --cleanenv --env R_LIBS_USER=${my_packages} ${rstudio_singularity_image} Rscript myscript.R

To submit the job, execute the command:

sbatch runscript.sh

R and RStudio on Windows

See our R and RStudio on Windows page.

Advanced Usage: Not Better, Not Faster and Not Recommended *

( * Fine, it could be faster if you really know what you are doing. Still not recommended.)

These options are for users familiar with software installation from the source, where you choose compilers and set your environmental variables. If you are not familiar with these concepts, we highly recommend using RStudio Server instead.

WARNING: If you got really good at using RStudio with Open OnDemand, and think you are expert now– think you are advanced now–  You are not.  Go back to Open OnDemand where you were well supported and experienced near effortless success.

R module

To use R module, you should first have taken our Introduction to the FASRC training and be familiar with running jobs on the cluster. R modules come with some basic R packages. If you use a module, you will likely have to install most of the R packages that you need.

To use R on the FASRC clusters, load R via our module system. For example, this command will load the latest R version:

module load R

If you need a specific version of R, you can search with the command

module spider R

To load a specific version

module load R/4.2.2-fasrc01

For more information on modules, see the Lmod Modules page.

To use R from the command line, you can use an R shell for interactive work. For batch jobs, you can use R CMD BATCH and RScript commands. Note that these commands have different behaviors:

  • R CMD BATCH
    • output will be directed to a .Rout file unless you specify otherwise
    • prints out input statements
    • cannot output to STDOUT
  • RScript
    • output and errors are directed to to STDOUT and STDERR, respectively, as many other programs
    • does not print input statements

For slurm batch examples, refer to FASRC User_Codes Github repository:

Examples and details of how to run R from the command line can be found at:

R Module + RStudio Desktop

RStudio Desktop depends on an R module. Although it has some precompiled R packages that comes with the R module, it is a much more limited list than the RStudio Server app.

RStudio Desktop runs on the host operating system (OS), the same environment as when you ssh to Cannon or FASSE.

This app is particularly useful to run multi-node/mpi applications because the you can specify the exact modules, compilers, and packages that you need to load.

See how to launch RStudio Desktop documentaiton.

R in Jupyter

To use R in Jupyter, you will need to create a conda/mamba virtual environment and install packages jupyter and rpy2 , which will allow you to use R in Jupyter.

Step 1:  Request an interactive job

salloc --partition test --time 02:00:00 --ntasks=1 --mem 10000

Step 2: Load python module, set environmental variables, and create an environment with the necessary packages:

module load python/3.10.13-fasrc01
export PYTHONNOUSERSITE=yes
mamba create -n rpy2_env jupyter numpy matplotlib pandas scikit-learn scipy rpy2 r-ggplot2 -c conda-forge -y

See Python instructions for more details on Python and mamba/conda environments.

After creating the mamba/conda environment, you will need to load that environment by selecting the corresponding kernel on the Jupyter Notebook to start using R in the notebook.

Step 3: Launch the Jupyter app on the OpenOnDemand VDI portal using these instructions.

You may need to load certain modules for package installations. For example, R package lme requires cmake. You can load cmake by adding the module name in the field “Extra Modules”:

Step 4: Open your Jupyter notebook. On the top right corner, click on “Python 3” (typically, it has “Python 3”, but it may be different on your Notebook). Select the created conda environment “Python [conda env:conda-rpy2_env]”:

Alternatively, you can use the top menu: Kernel -> Change Kernel -> Python [conda env:conda-rpy2_env]

Step 5: Install R packages using a Jupyter Notebooks

Refer to the example Jupyter Notebook on FASRC User_Codes Github.

R with Spack

Step 1: Install Spack by following our Spack Install and Setup instructions.

Step 2: Install the R packages with Spack from the command line. For all R package installations with Spack, ensure you are in a compute node by requesting an interactive job (if you are already in a interactive job, there is no need to request another interactive job):

[jharvard@holylogin spack]$ salloc --partition test --time 4:00:00 --mem 16G -c 8
Installing R packages with spack is fairly simple. The main steps are:
[jharvard@holy2c02302 spack]$ spack install package_name  # install software
[jharvard@holy2c02302 spack]$ spack load package_name     # load software to your environment
[jharvard@holy2c02302 spack]$ R                           # launch R
> library(package_name)                                   # load package within R
For specific examples, refer to FASRC User_Codes Github repository:

R Parallel

This is covered at R Parallel here on User Docs, and EXTENSIVELY in User Docs/Parallel_Computing/R.

Troubleshooting

Files that may configure R package installations

  • ~/.Rprofile
  • ~/.Renviron
  • ~/.bashrc
  • ~/.bash_profile
  • ~/.profile
  • ~/.config/rstudio/rstudio-prefs.json
  • ~/.R/Makevars

Examples

We offer a wealth of examples, see R in our User Codes git repository.

References

]]>
27082
Seminar Series: Parallel Computing on the Cluster https://docs.rc.fas.harvard.edu/kb/parallel-computing-on-cluster/ Thu, 10 Mar 2016 22:23:30 +0000 https://rc.fas.harvard.edu/?page_id=14961 This three-seminar series will help you understand the basics of parallel computing and will present two common frameworks for writing parallelized code. RSVP is recommended but not required.

Introduction to Parallel Computing
This new seminar will provide a brief overview of the extensive and broad topic of parallel computing, as a lead-in for the future seminars. It covers the basics of parallel computing – what it is and how it’s used, followed by discussions on terminology associated with parallel computing, parallel memory architectures, and programming models. The session also explores the considerations related to designing and running parallel programs, with some considerations of doing so on high-end compute systems. This seminar is open for persons at all skill levels, but basic unix and programming skills are assumed.
Upcoming Classes:
No classes scheduled at this time
Seminar Materials
Intro to Parallel Computing slide deck

Parallelizing Code with OpenMP
OpenMP provides a portable, scalable model for developers of shared memory parallel applications. It supports C/C++ and Fortran on a wide variety of architectures. This session provides an introduction to OpenMP, including its various constructs and directives for specifying parallel regions, work sharing, synchronization and data environment, and how to effectively use this on the cluster. Basic unix and programming skills are assumed, and attendance at the Introduction to Parallel Computing seminar strongly encouraged.
No classes scheduled at this time
Seminar Materials
Parallel Computing with OpenMP slide deck

Parallelizing Code with MPI
This session provides an introduction to the Message Passing Interface (MPI) and how to develop and run simple parallel programs according to the MPI standard. It covers topics such as MPI Environment Management, Point-to-Point, and Collective Communication. It also discusses the considerations for efficient parallelization with MPI. Illustrative examples in both C++ and Fortran are provided.
No classes scheduled at this time

]]>
14961
Hybrid (MPI+OpenMP) Codes on the FASRC cluster https://docs.rc.fas.harvard.edu/kb/hybrid-mpiopenmp-codes-on-odyssey/ Thu, 22 Oct 2015 20:24:56 +0000 https://rc.fas.harvard.edu/?page_id=14121  

Introduction

This page will help you compile and run hybrid (MPI+OpenMP) applications on the cluster. Currently we have both OpenMPI and Mvapich2 MPI libraries available, compiled with both Intel and GNU compiler suits.

Example Code

Below are simple hybrid example codes in Fortran 90 and C++.
Fortran 90:


!=====================================================
! Program: hybrid_test.f90 (MPI + OpenMP)
!          FORTRAN 90 example - program prints out
!          rank of each MPI process and OMP thread ID
!=====================================================
program hybrid_test
  implicit none
  include "mpif.h"
  integer(4) :: ierr
  integer(4) :: iproc
  integer(4) :: nproc
  integer(4) :: icomm
  integer(4) :: i
  integer(4) :: j
  integer(4) :: nthreads
  integer(4) :: tid
  integer(4) :: omp_get_num_threads
  integer(4) :: omp_get_thread_num
  call MPI_INIT(ierr)
  icomm = MPI_COMM_WORLD
  call MPI_COMM_SIZE(icomm,nproc,ierr)
  call MPI_COMM_RANK(icomm,iproc,ierr)
!$omp parallel private( tid )
  tid = omp_get_thread_num()
  nthreads = omp_get_num_threads()
  do i = 0, nproc-1
     call MPI_BARRIER(icomm,ierr)
     do j = 0, nthreads-1
        !$omp barrier
        if ( iproc == i .and. tid == j ) then
           write (6,*) "MPI rank:", iproc, " with thread ID:", tid
        end if
     end do
  end do
!$omp end parallel
  call MPI_FINALIZE(ierr)
  stop
end program hybrid_test

C++:


//==========================================================
//  Program: hybrid_test.cpp (MPI + OpenMP)
//           C++ example - program prints out
//           rank of each MPI process and OMP thread ID
//==========================================================
#include <iostream>
#include <mpi.h>
#include <omp.h>
using namespace std;
int main(int argc, char** argv){
  int iproc;
  int nproc;
  int i;
  int j;
  int nthreads;
  int tid;
  int provided;
  MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_rank(MPI_COMM_WORLD,&iproc);
  MPI_Comm_size(MPI_COMM_WORLD,&nproc);
#pragma omp parallel private( tid )
  {
    tid = omp_get_thread_num();
    nthreads = omp_get_num_threads();
    for ( i = 0; i <= nproc - 1; i++ ){
      MPI_Barrier(MPI_COMM_WORLD);
      for ( j = 0; j <= nthreads - 1; j++ ){
        if ( (i == iproc) && (j == tid) ){
          cout << "MPI rank: " << iproc << " with thread ID: " << tid << endl;
        }
      }
    }
  }
  MPI_Finalize();
  return 0;
}

Compiling the program

MPI Intel, Fortran 90: [username@rclogin02 ~]$ mpif90 -o hybrid_test.x hybrid_test.f90 -openmp
MPI Intel, C++:        [username@rclogin02 ~]$ mpicxx -o hybrid_test.x hybrid_test.cpp -openmp
MPI GNU, Fortran 90:   [username@rclogin02 ~]$ mpif90 -o hybrid_test.x hybrid_test.f90 -fopenmp
MPI GNU, C++:          [username@rclogin02 ~]$ mpicxx -o hybrid_test.x hybrid_test.cpp -fopenmp

Running the program

You could use the following SLURM batch-job submission script to submit the job to the queue:

#!/bin/bash
#SBATCH -J hybrid_test
#SBATCH -o hybrid_test.out
#SBATCH -e hybrid_test.err
#SBATCH -p shared
#SBATCH -n 2
#SBATCH -c 4
#SBATCH -t 180
#SBATCH --mem-per-cpu=4G
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK --mpi=pmix ./hybrid_test.x

The OMP_NUM_THREADS environmental variable is used to set the number of threads to the desired number. Please notice that this job will use 2 MPI processes (set with the -n option) and 4 OpenMP threads per MPI process (set with the -c option), so overall the job will reserve and use 8 compute cores. If you name the above script omp_test.batch, for instance, the job is submitted to the queue with

sbatch omp_test.batch

Upon job completion, job output will be located in the file hybrid_test.out with the contents:

 MPI rank:           0  with thread ID:           0
 MPI rank:           0  with thread ID:           1
 MPI rank:           0  with thread ID:           2
 MPI rank:           0  with thread ID:           3
 MPI rank:           1  with thread ID:           0
 MPI rank:           1  with thread ID:           1
 MPI rank:           1  with thread ID:           2
 MPI rank:           1  with thread ID:           3
]]>
14121
Stata https://docs.rc.fas.harvard.edu/kb/stata-on-cluster/ Tue, 05 Jan 2010 16:37:23 +0000 http://rc-dev.rc.fas.harvard.edu/stata-on-odyssey/ Description

STATA, a powerful statistical software package, is widely used by researchers, analysts, and academics across various disciplines. Renowned for its versatility, STATA enables users to efficiently analyze, manage, and visualize data, making it an indispensable tool for both novice and advanced data practitioners. Its intuitive command syntax facilitates seamless data manipulation, regression analysis, time-series modeling, and more, empowering users to uncover meaningful insights from complex datasets. With its robust suite of features and user-friendly interface, STATA continues to be a cornerstone in statistical analysis and research methodology.

Usage

Open OnDemand

STATA can be run from Open OnDemand (OOD, formerly known as VDI) by clicking on Stata icon or choosing it from the Interactive Apps menu, and specifying your resource needs. Hit Launch, wait for the session to start, and click the “Launch Stata” button.

You can also launch Stata from the Remote Desktop app on OOD.

Output file permissions

Stata appears to override filesystem-level permissions structures such as file-ACLs.  In a test using stata/17.0-fasrc01 the .dta files produced by Stata were consistent with the user’s umask, despite default file-ACLs that should have created different effective permissions.  It appears as though Stata is modifying the permissions after writing the file (i.e. after the default file-ACLs have been applied).  The solution for this should be to set the desired umask in the Slurm submission script or on the command line prior to submitting the batch job (though a ‘umask’ command in the Slurm submission script would be preferable in most cases).

Running Stata on FASSE

For how to set the proxies on Stata, see our FASSE proxy documentation.

Troubleshooting

Stata I/O error

If you get the error:

I/O error writing .dta file
Usually such I/O errors are caused by the disk or file system being full.

This is because Stata writes temporary files to disk (instead of only storing on memory). The location that Stata writes temporary files to disk is set, by default, to \tmp. If \tmp‘s size is smaller than your combined datasets, then Stata will not have enough space to write temporary files.

Solution: set the environmental variable STATATMP to a directory that is large enough.

Stata stand-alone app (Open OnDemand)

You can avoid this error by increasing the value in the option “Minimum size of STATATMP in GB”. We recommend increasing to twice the size of the datasets that you are using.

If your total size is more than 68GB, check the size available on each partition on Cannon and FASSE (see last column “/scratch size (GB)”).

On FASSE, if you do not have access to a partition that has enough /scratch space, you can:

  1. use partition serial_requeue (do note that serial_requeue jobs may be preempted)
  2. use the Stata with Remote Desktop app (explained below), but instead of using /scratch, use /n/netscratch/PI_lab

Command line interface or Stata from Remote Desktop app

You must set STATATMP before launching Stata with:

export STATATMP=/scratch/$USER/stata_dir
mkdir -p $STATATMP

Alternatively, you can set to a lab share as well. However, local scratch has better performance.

Additionally, when you submit a job, you can add the slurm option --tmp to request that the local disk has a minimum size. For example, this requests a disk with at least 150GB:

#SBATCH --tmp=150G

Examples

You can find examples on how to run STATA in both serial and parallel modes, in the FASRC User Codes GitHub repository.

Resources

These are some external resources with many examples on how to use Stata:

]]>
5385
MATLAB https://docs.rc.fas.harvard.edu/kb/matlab/ Tue, 05 Jan 2010 20:01:04 +0000 http://rc-dev.rc.fas.harvard.edu/matlab/ Introduction

MATLAB, short for Matrix Laboratory, is a high-performance language and interactive environment for numerical computation, visualization, and programming. Developed by MathWorks, MATLAB integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. It is widely used in academia and industry for data analysis, algorithm development, and modeling and simulation. One of MATLAB’s key strengths is its extensive library of built-in functions and toolboxes, which cover a wide range of scientific and engineering applications, including signal processing, control systems, neural networks, and machine learning.

In addition to its robust built-in capabilities, MATLAB offers a user-friendly interface that simplifies the process of writing and debugging code. The platform supports both procedural and object-oriented programming, providing flexibility in how users approach problem-solving. MATLAB’s powerful plotting functions make it easy to visualize data and results, which is crucial for interpreting complex numerical information. Furthermore, MATLAB supports integration with other programming languages like C, C++, Java, and Python, allowing for versatile application development. Its widespread adoption in various fields, from finance and biotechnology to automotive and aerospace engineering, underscores its versatility and effectiveness as a tool for both research and practical problem-solving.

Installation

MATLAB Licenses

Research Computing has licensed MATLAB from Mathworks for use on desktops, laptops, and on the FASRC cluster. If you wish to run MATLAB on your desktop/laptop, please follow the instructions on the FAS downloads page (downloads for all platforms are available from Mathworks). Running MATLAB on the cluster can be done through a GUI using VDI, at the command line interactive, or through batch jobs.
NOTE! These instructions discuss the single-core (process) implementation. If you wish to run MATLAB as a multi-core/process job, please see our companion document.

Examples

MATLAB GUI on the FASRC cluster

The MATLAB GUI can be run using the power of the compute nodes of the cluster by initiating your session via our graphical login, and starting an interactive MATLAB session. This is almost like running MATLAB on your desktop/laptop, except all the computation is done on the cluster. There two ways of using the MATLAB GUI on the cluster:

MATLAB VDI APP

  • Log on to the cluster via vdi.rc.fas.harvard.edu. For more details on Open OnDemand (VDI) go here.
  • From the dashboard select the Matlab app.
  • Specify the resources for your job (e.g., partition, total memory, number of cores, Matlab version, etc.)
  • Launch the job.

MATLAB software module

  • Log on to the cluster via vdi.rc.fas.harvard.edu.
  • From the dashboard select the Remote Desktop app.
  • Specify the resources for your job and launch it.
  • Once the job start, open a terminal in the remote desktop.
  • In the terminal, load the MATLAB module
module load matlab # This will load the default (latest MATLAB version)
  • Start the MATLAB GUI. In the terminal type in
matlab

MATLAB on the command line (interactive terminal session)

MATLAB can also be run on the command line in an interactive terminal session. Since there is no GUI, you must include additional parameters (-nojvm -nosplash -nodesktop), and you can optionally specify an M file (e.g., script.m or function.m) to run, including any script or function parameters as required:

  • Login to the cluster as explained here.
  • Once logged in, get an interactive session as described here, e.g.,
salloc --mem=4G -t 60 -c 1 -N 1 -p test
  • Load a Matlab module, e.g,
module load matlab/R2022b-fasrc01
  • Start MATLAB interactively without a GUI
matlab

Examples

To get started with Matlab on the FAS cluster you can try the below examples:

  • Example1: Monte-Carlo computation of PI
  • Example2: Sums up integers from 1 to N
  • Example3: Generates a multi-figure on a 3 X 3 grid
  • Example4: Illustration of job arrays and parameter sweeps
  • Example5: Random vectors and job arrays

References:

]]>
5374