AI – FASRC DOCS https://docs.rc.fas.harvard.edu Wed, 29 Oct 2025 01:59:02 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png AI – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 KNIME on the FASRC clusters https://docs.rc.fas.harvard.edu/kb/knime-on-the-fasrc-clusters/ Wed, 21 May 2025 15:23:14 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=28725 Description

KNIME is an open-source data analytics, reporting, and integration platform that is meant to perform various aspects of machine-learning & data mining through its modular data pipelining concept. The platform offers a way to integrate various tasks ranging from developing analytic models to deploying them and sharing insights with your team. The KNIME Analytics Platform offers users 300+ connectors to multiple data sources and integrations to all popular machine learning libraries.

The software’s key capabilities include Data Access & Transformation, Data Analytics, Visualization & Reporting, Statistics & Machine Learning, Generative AI, Collaboration, Governance, Data Apps, Automation, AI Agents

Given KNIME’s wide scale use and applicability, we have converted it into a system-wide module that can be loaded from anywhere on any of the FASRC clusters, Cannon or FASSE. Additionally, we have packaged it as an app that can be launched using the cluster web interface, Open on Demand (OOD).

KNIME as a module

Knime is available as a module on the FASRC clusters. In order to know more about the module including the versions available and how to load one of them, execute from a terminal on the cluster: module spider knime

This would pull up the information on the versions of KNIME software that are available to load. For example, for a user jharvard on a compute node, the module spider command would produce the following output:


[jharvard@holy8a26602 ~]$ module spider knime/
knime:
Description:
An open-source data analytics, reporting, and integration platform meant to perform various aspects of machine-learning & data mining through its modular data pipelining concept.

Versions:
knime/5.4.3-fasrc01
knime/5.4.4-fasrc01

For detailed information about a specific "knime" package (including how to load the modules) use the module's full name.

Note that names that have a trailing (E) are extensions provided by other modules.

For example:
$ module spider knime/5.4.4-fasrc01


To load a specific module, one can execute: module load knime/5.4.3-fasrc01

Or, to load the default & typically the latest module, one can run: module load knime command. This would result in, e.g.:

[jharvard@holy8a26602 ~]$ module load knime
[jharvard@holy8a26602 ~]$ module list
Currently Loaded Modules:
  1) knime/5.4.4-fasrc01

Once the knime module is loaded, one can launch the GUI by running the knime executable on the terminal provided you ssh into the cluster using X11 forwarding, preferably with the -Y option, and that XQuartz (MacOS) or  MobaXterm (Windows) is installed on your local device that is being used to login to the cluster. For example:

ssh -Y jharvard@login.rc.fas.harvard.edu

[jharvard@holylogin05 ~]$ salloc -p test --x11 --time=2:00:00 --mem=4g
[jharvard@holy8a26602 ~]$ module load knime
[jharvard@holy8a26602 ~]$ knime

One can ignore the following libGL errors and should expect to see a GUI appear as shown in the screen shot below.
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast

Knime GUI launched directly on Cannon

Note: While you can launch KNIME directly on the cluster using X11 forwarding, it is laggy and doesn’t render itself well to faster executions that might be needed for certain KNIME workflows. To avoid issues associated with X11 forwarding, we recommend launching KNIME using OOD.

Both these modules are also available to use via the Knime OOD app, as explained below.

KNIME on OOD

KNIME can be run from Open OnDemand (OOD, formerly known as VDI) by choosing it from the Interactive Apps menu, and specifying your resource needs. Hit Launch, wait for the session to start, and click the “Launch Knime” button.

You can also launch KNIME from the Remote Desktop app on OOD.

Pre-installed Extensions

Both KNIME modules come with the following pre-installed extensions:

  1. For GIS: Geospatial Analytics Extension for KNIME
  2. For Programming:
    KNIME Python Integration
    KNIME Interactive R Statistics Integration
  3. For Machine Learning:
    KNIME H2O Machine Learning Integration
    KNIME XGBoost Integration
    KNIME Machine Learning Interpretability Extension
  4. For OpenAI, Hugging Face, and other LLMs: KNIME AI Extension
  5. For AI Assistant Coding: KNIME AI Assistant (Labs)
  6. For Google Drive Integration: KNIME Google Connectors

Note: New extensions cannot be installed by the users on the fly as modules don’t come with write permissions.

KNIME Tutorial

The link here takes you to the KNIME tutorial that has been prepared by Lingbo Liu from Harvard’s Center for Geographic Analysis (CGA). This tutorial is best executed by launching the Knime app on OOD.

]]>
28725
OpenAI https://docs.rc.fas.harvard.edu/kb/openai/ Fri, 22 Nov 2024 19:04:55 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27996 Description

See OpenAI website and documentation.

Security

Please, carefully read Harvard’s AI guidelines and Generative AI tool comparision.  See our FASRC Guidelines for OpenAI Key and Harvard Agreement.

You can only use openAI and other genAI on your own is if it is not Harvard work related and/or public, non-sensitive data (data security level 1).

For data security levels 2 and 3, you need to work with your school to discuss your needs. It is your responsibility to make sure you get setup with the appropriate contractual coverage and environment — esp. to avoid having the model learn from your input and leak sensitive information.

Installation

You can install OpenAI in a conda/mamba environment:

[jharvard@boslogin01 ~]$ salloc --partition test --time 01:00:00 --mem-per-cpu 4G -c 2
[jharvard@holy8a24301 ~]$ module load python/3.10.12-fasrc01
[jharvard@holy8a24301 ~]$ export PYTHONNOUSERSITE=yes
[jharvard@holy8a24301 ~]$ mamba create -n openai_env openai

Run OpenAI

You will need to provide an OpenAI key. You can generate one from
https://platform.openai.com/api-keys.


# Request an interactive job
[jharvard@boslogin01 ~]$ salloc --partition test --time 01:00:00 --mem-per-cpu 4G -c 2
# Source conda environment
[jharvard@holy8a24301 ~]$ source activate openai_env
# replace my_key with the key that you generated on OpenAI's website
[jharvard@holy8a24301 ~]$ export OPENAI_API_KEY='my_key'
# set SSL_CERT_FILE with system's certificate
(openai_env) [jharvard@holy8a24301 ~]$ export SSL_CERT_FILE='/etc/pki/tls/certs/ca-bundle.crt'
# run OpenAI example
(openai_env) [jharvard@holy8a24301 ~]$ python openai-test.py

Note: OpenAI uses the python package httpx. You must set the variable SSL_CERT_FILE to use the system’s certificate. If you do not set SSL_CERT_FILE OpenAI will give this error:


ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)

Examples

See User Codes on OpenAI including an example on OpenAI Whisper.

Resources

 

]]>
27996
HeavyAI https://docs.rc.fas.harvard.edu/kb/heavyai/ Tue, 05 Nov 2024 19:31:10 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27885 What is HeavyAI?

See HeavyAI website.

This software was formerly known as OmniSci.

HeavyAI in the FASRC Cannon cluster

HeavyAI is implemented on the FASRC cluster using Singularity.

We recommend carefully reading HeavyAI hardware recommendation as they provide details about number of cores and RAM (memory) that you should request. They also recommend using SSD storage, which is available through local scratch.

You may request specific GPU cards on the FASRC clusters using the --constraint flag. See our Job Constraints documentation for more details.

Installation

Our recommendation is that you use HeavyAI through the Open On Demand interface via a VPN connection.  Also, we have learning sessions for Open on Demand..

Examples

Should you need the command line interface, User Codes is an example on how to do so..

Note: You will need to provide your own license key to use HeavyAI. FASRC does not provide a license. You can request a free version on HeavyAI downloads page. Note that the free license only allows limited computational resources.

Resources

  • Support Portal: The HEAVY.AI Support Portal offers a knowledge base, FAQs, troubleshooting resources, and access to community discussions, providing valuable assistance for academic users.
  • Resource Center: The HEAVY.AI Resource Center features whitepapers, solution briefs, case studies, and videos that can be beneficial for academic research and teaching purposes.
]]>
27885
Claude https://docs.rc.fas.harvard.edu/kb/claude/ Mon, 23 Sep 2024 21:19:55 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27676 Description

See Claude on the website and documentation.

Security

Please, carefully read Harvard’s AI guidelines and Generative AI tool comparison.

You can only use Anthropic tools and other genAI models on non-sensitive data (data security level 1) public data on Cannon.

For data security levels 2 and 3, you need to work with your school to discuss your needs. It is your responsibility to make sure you get setup with the appropriate contractual coverage and environment — esp. to avoid having the model learn from your input and leak sensitive information.

Installation

You can install Claude in a conda/mamba environment

Here is a quick script to install Claude:

#!/bin/bash

salloc --partition=test --time=02:00:00 --mem=8G --cpus-per-task=2
module load python
export PYTHONNOUSERSITE=yes
mamba create --name claude_env python -y
source activate claude_env
pip install anthropic
conda deactivate

Running Claude

You will need to provide an Anthropic API key. You can generate one from their API page. Also, see their quickstart guide.

Examples

See FASRC User Codes repo for example Claude scripts.

See also Anthropic’s own Anthropic Cookbooks.

Resources

Anthropic Quickstarts: A collection of projects designed to help developers quickly get started with building applications using the Anthropic API. Each quickstart provides a foundation that you can easily build upon and customize for your specific needs.

Anthropic Official Documentation: Comprehensive guide to using Claude, including setup, API usage, and troubleshooting.

Claude AI API Access: Portal to access the Claude API, set up API keys, and manage your integrations.

Claude’s Capabilities and Model Family: Learn about Claude’s different models such as Sonnet and Haiku, tailored for various performance needs.

]]>
27676
PyTorch https://docs.rc.fas.harvard.edu/kb/pytorch/ Fri, 20 Sep 2024 23:53:59 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27662 Description

PyTorch, developed by Facebook’s AI Research lab, is an open-source machine learning library that offers a flexible platform for building deep learning models. It features a Python front end and integrates seamlessly with Python libraries like NumPy, SciPy, and Cython to extend its functionality. Unique for its use of dynamic computational graphs, unlike TensorFlow’s static graphs, PyTorch allows for greater flexibility in model design. This is particularly advantageous for research applications involving novel architectures.

The library supports GPU acceleration, enhancing performance significantly, which is vital for tackling high-level research tasks in areas such as climate change modeling, DNA sequence analysis, and AI research that involve large datasets and complex architectures. Automatic differentiation in PyTorch is handled through a tape-based system at both the functional and neural network layers, offering both speed and flexibility as a deep learning framework.

Installing PyTorch

These instructions are intended to help you install PyTorch on the FASRC cluster.

GPU Support:  For general information on running GPU jobs refer to our user documentation. To set up PyTorch with GPU support in your user environment, please follow the below steps:

PyTorch with CUDA 12.1 in a conda environment

These instructions set up a conda environment with PyTorch version 2.2.1 and CUDA version 12.1, where the cuda-toolkit is installed directly in the conda environment.

Start an interactive job requesting GPUs, e.g., (Note: you will want to start a session on the same type of hardware as what you will run on)

salloc -p gpu -t 0-06:00 --mem=8000 --gres=gpu:1

Load required software modules, e.g.,

module load python/3.10.13-fasrc01

Create a conda environment, e.g.,

mamba create -n pt2.3.0_cuda12.1 python=3.10 pip wheel

Activate the new conda environment:

source activate pt2.3.0_cuda12.1

Install cuda-toolkit version 12.1.0 with mamba

mamba install -c  "nvidia/label/cuda-12.1.0" cuda-toolkit=12.1.0

Install PyTorch with mamba

mamba install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Install additional Python packages, if needed, e.g.,

mamba install -c conda-forge numpy scipy pandas matplotlib seaborn h5py jupyterlab jupyterlab-spellchecker scikit-learn

PyTorch with CUDA 11.8 from a software module

These instructions set up a conda environment with PyTorch version 2.2.0 and CUDA version 11.8, where CUDA is loaded as a software module, cuda/11.8.0-fasrc01

# Start an interactive job on a GPU node (target the architecture where you plan to run), e.g.,
salloc -p gpu -t 0-06:00 --mem=8000 --gres=gpu:1

# Load the required modules, e.g.,
module load python 
module load cuda/11.8.0-fasrc01 # CUDA version 11.8.0

# Create a conda environment and activate it, e.g.,
mamba create -n pt2.2.0_cuda11.8 python=3.10 pip wheel -y
source activate pt2.2.0_cuda11.8

# Install PyTorch
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install additional packages, e.g.,
mamba install pandas scikit-learn matplotlib seaborn jupyterlab -y

Installing PyG (torch geometry)

After you create the conda environment pt2.3.0_cuda12.1 and activated it, you can install PyG in your environment with the command:

(pt2.3.0_cuda12.1) [username@holygpu7c26103 ~]$ mamba install pyg -c pyg

Running PyTorch:

If you are running PyTorch on GPU with multi-instance GPU (MIG) mode on (e.g. gpu_test partition), see PyTorch on MIG mode

PyTorch checks

You can run the following tests to ensure that PyTorch was installed properly and can find the GPU card. Example output of PyTorch checks:

(pt2.3.0_cuda12.1_v0) [jharvard@holygpu7c26106 ~]$ python -c 'import torch;print(torch.__version__)'
2.3.0
(pt2.3.0_cuda12.1_v0) [jharvard@holygpu7c26106 ~]$ python -c 'import torch;print(torch.cuda.is_available())'
True
(pt2.3.0_cuda12.1_v0) [jharvard@holygpu7c26106 ~]$ python -c 'import torch;print(torch.cuda.device_count())'
1
(pt2.3.0_cuda12.1_v0) [jharvard@holygpu7c26106 ~]$ python -c 'import torch;print(torch.cuda.current_device())'
0
(pt2.3.0_cuda12.1_v0) [jharvard@holygpu7c26106 ~]$ python -c 'import torch;print(torch.cuda.device(0))'
<torch.cuda.device object at 0x14942e6579d0>
(pt2.3.0_cuda12.1_v0) [jharvard@holygpu7c26106 ~]$ python -c 'import torch;print(torch.cuda.get_device_name(0))'
NVIDIA A100-SXM4-40GB MIG 3g.20gb

Run PyTorch Interactively

For an interactive session to work with the GPUs you can use following:

salloc -p gpu -t 0-06:00 --mem=8000 --gres=gpu:1

Load required software modules and source your PyTorch conda environment.

[username@holygpu7c26103 ~]$ module load python/3.10.12-fasrc01
[username@holygpu7c26103 ~]$ source activate pt2.3.0_cuda12.1
(pt2.3.0_cuda12.1) [username@holygpu7c26103 ~]$

Test PyTorch interactively:

(pt2.3.0_cuda12.1) [username@holygpu7c26103 ~]$ python check_gpu.py
Using device: cuda

NVIDIA A100-SXM4-40GB
Memory Usage:
Allocated: 0.0 GB
Reserved:  0.0 GB

tensor([[-2.3792, -1.2330, -0.5143,  0.5844]], device='cuda:0')

check_gpu.py: checks if GPUs are available and if available sets up the device to use them.

Run PyTorch with Batch Jobs

An example batch-job submission script is included below:

#!/bin/bash
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 0-00:30
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=4G
#SBATCH -o pytorch_%j.out 
#SBATCH -e pytorch_%j.err 

# Load software modules and source conda environment
module load python/3.10.12-fasrc01
source activate pt2.3.0_cuda12.1

# Run program
srun -c 1 --gres=gpu:1 python check_gpu.py

If you name the above batch-job submission script run.sbatch, for instance, the job is submitted with:

sbatch run.sbatch

PyTorch and Jupyter Notebook on Open OnDemand

If you would like to use the PyTorch environment on Open OnDemand/VDI, you will also need to install packages ipykernel and ipywidgets with the following commands:

(pt2.3.0_cuda12.1) [username@holygpu7c26103 ~]$ mamba install ipykernel ipywidgets

PyTorch on MIG Mode

Note: Currently only the gpu_test partition has MIG mode enabled.

# Get GPU card name
nvidia-smi -L

# Set CUDA_VISIBLE_DEVICES with the MIG instance
export CUDA_VISIBLE_DEVICES=MIG-5b36b802-0ab0-5f37-af2d-ac23f40ef62d

Or automate the process with:

export CUDA_VISIBLE_DEVICES=$(nvidia-smi -L | awk '/MIG/ {gsub(/[()]/,"");print $NF}')

Best Practices

PyTorch and Jupyter Notebook on Open OnDemand

To use PyTorch in Jupyter Notebook on Open OnDemand/VDI, install ipykernel and ipywidgets:

mamba install ipykernel ipywidgets

Pull a PyTorch Singularity Container

Alternatively, you can pull and use a PyTorch singularity container:

singularity pull docker://pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

Other PyTorch/cuda versions

To install other versions, refer to the PyTorch compatibility chart:

Examples

For example scripts covering installation, and use cases, see our User Codes > AI > PyTorch repo.

External Resources:

]]>
27662
Python Package Installation https://docs.rc.fas.harvard.edu/kb/python-package-installation/ Wed, 04 Sep 2024 17:02:13 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27591 Description

Python packages on the cluster are primarily managed with Mamba.  Direct use of pip, outside of a virtual environment, is discouraged on the FASRC clusters.

Mamba is a package manager that is a drop-in replacement for Conda, and is generally faster and better at resolving dependencies:

  • Speed: Mamba is written in C++, which makes it faster than Conda. Mamba uses parallel processing and efficient code to install packages faster.
  • Compatibility: Mamba is fully compatible with Conda, so it can use the same commands, packages, and environment configurations.
  • Cross-platform support: Mamba works on Mac, Linux and Windows.
  • Dependency resolution: Mamba is better at resolving dependencies than Conda.
  • Environment creation: Mamba is faster at creating environments, especially large ones.
  • Package repository: Mamba uses Mambaforge ( aka conda-forge ), the most up to date packages available.

Important:
Anaconda is currently reviewing its Terms of Service for Academia and Research and is expected to conclude the update by the end of 2024. There is a possibility that  Conda may no longer be free for non-profit academic research use at institutions with more than 200 employees. And downloading packages through Anaconda’s Main channel may incur costs.  Hence, we recommend our users switch to using open-source conda-forge channel for package distribution when possible. Our python module is built with Miniforge3 distribution that has conda-forge set as its default channel. 

Mamba is a drop-in replacement for Conda and uses the same commands and configuration options as conda. You can swap almost all commands between conda & mamba.  By default, mamba uses conda-forge, the free Mambaforge package repository.  ( In this doc, we will generally only refer to mamba.)

Usage

mamba is available on the FASRC cluster as a software module either as Mambaforge or as python/3* which is aliased to mamba. Once can access this by loading either of the following modules:

$ module load python/{PYTHON_VERS}-fasrc01
$ python -V Python {PYTHON_VERS}

Environments

You can create a virtual environments with mamba in the same way as with conda. However, it is important to start an interactive session prior to creating an environment and installing desired packages in the following manner:

$ salloc --partition test --nodes=1 --cpus-per-task=2 --mem=4GB --time=0-02:00:00

$ module load python/{PYTHON_VERS}-fasrc01

You don’t need to export these two variables before creating your mamba environment setting the package location to the standard place.

export CONDA_PKGS_DIRS=~/conda/pkgs
export CONDA_ENVS_PATH=~/conda/envs

However, If you need to locate the packages elsewhere, like a shared directory, then specify the absolute file path.

export CONDA_PKGS_DIRS=/<FILEPATH>/conda/pkgs
export CONDA_ENVS_PATH=/<FILEPATH>/conda/envs

Create an environment using mamba: $ mamba create -n <ENV_NAME>

You can also install packages with the create command that could speed up your setup time significantly. For example,

$ mamba create -n <ENV_NAME> <PACKAGES> 
$ mamba create -n python_env1 python={PYTHON_VERS} pip wheel

You must activate an environment in order to use it or install any packages in it. To activate and use an environment: $ source activate python_env1

To deactivate an active environment: $ source deactivate

You can list the packages currently installed in the mamba or  conda environment with: $ mamba list

To install new packages in the environment with mamba using the default channel:

 $ mamba install -y <PACKAGES>

For example: $ mamba install -y numpy 

To install a package from a specific channel, instead:

$ mamba install --channel <CHANNEL-NAME> -y <PACKAGE>

For example: $ mamba install --channel conda-forge boltons

To uninstall packages: $ mamba uninstall PACKAGE

To delete an environment: $ conda remove -n <ENV_NAME> --all -y

For additional features, please refer to the Mamba documentation.

Pip Installs

Avoid using pip outside of a mamba environment on any FASRC cluster. If you run pip install outside of a mamba environment, the installed packages will be placed in your $HOME/.local directory, which can lead to package conflicts and may cause some packages to fail to install or load correctly via mamba.

For example, if your environment name is python_env1:

$ module load python
$ source activate python_env1
$ pip install <package_name>

Best Practices

Use mamba environment in Jupyter Notebooks

If you would like to use a mamba environment as a kernel in a Jupyter Notebook on Open OnDemand (Cannon OOD or FASSE OOD), you have to install packages, ipykernel and nb_conda_kernels. These packages will allow Jupyter to detect mamba environments that you created from the command line.

For example, if your environment name is python_env1:

$ module load python
$ source activate python_env1
$ mamba install ipykernel nb_conda_kernels
After these packages are installed, launch a new Jupyter Notebook job (existing Jupyter Notebook jobs will fail to “see” this environment). Then:
  1. Open a Jupyter Notebook (a .ipynb file)
  2. On the top menu, click Kernel -> Change kernel -> select the conda environment

Mamba environments in a desired location

With mamba, use the -p or --prefix option to specify writing environment files to a desired location, such as the holylabs location.  Don’t use your home directory as it has very low performance due to filesystem latency.  Using a lab share location, you can also share your conda environment with other people on the cluster.  Keep in mind, you will need to make the destination directory, and specify the python version to use.  For example:

$ mamba create --prefix /n/holylabs/{YOUR_LAB}/Lab/envs python={PYTHON_VERS}

$ source activate /n/holylabs/{YOUR_LAB}/Lab/envs

To delete an environment at that desired location: $ conda remove -p /n/holylabs/{YOUR_LAB}/Lab/envs --all -y

Troubleshooting

Interactive vs. batch jobs

If your code works in an interactive job, but fails in a slurm batch job,

  1. You are submitting your jobs from within a mamba environment.
    Solution 1: Deactivate your environment with the command mamba deactivate and submit the job or
    Solution 2: Open another terminal and submit the job from outside the environment.

  2. Check if your ~/.bashrc or ~/.bash_profile files have a section of conda initialize or a source activate command. The conda initialize section is known to create issues on the FASRC clusters.
    Solution: Delete the section between the two conda initialize statements. If you have source activate in those files, delete it or comment it out.
    For more information on ~/.bashrc files, see https://docs.rc.fas.harvard.edu/kb/editing-your-bashrc/

Jupyter Notebook or JupyterLab on Open OnDemand/VDI problems

See Jupyter Notebook or JupyterLab on Open OnDemand/VDI troubleshooting section.

Unable to install packages

If you are not being able to install packages or the package installation is taking a significantly long time, check your ~/.condarc file. As stated in Conda docs, this is an optional runtime configuration file. One can use this file to configure conda/mamba to search from specific channels for package installation.

We recommend users not have this file or keep it empty. This allows users to install packages in their conda/mamba environments using the defaults provided by the open-source distribution, Miniforge , that we have made available to our users via our newer Python modules.

If, for any reason, ~/.condarc exists in your cluster profile then check its contents. If the default channel is showing up as conda , edit it to conda-forge so that your ~/.condarc uses this open-source channel for package installation.

Similarly, if  you had created an environment a long time ago using the Anaconda distribution and it is no longer working, then it is best to create a new environment using the open-source distribution as described above while ensuring that ~/.condarc, if exists, is pointing to conda-forge as its default channel.

For example, if you created a conda environment using one of our older Python modules, say Anaconda2/2019.10-fasrc01, you can see that conda is configured to use repo.anaconda.com for package installation.

$ module load Anaconda2 
$ conda info 
... 
channel URLs : 
https://repo.anaconda.com/pkgs/main/linux-64 
https://repo.anaconda.com/pkgs/main/noarch 
https://repo.anaconda.com/pkgs/r/linux-64 
https://repo.anaconda.com/pkgs/r/noarch 
...

In order to change this configuration, you can execute the conda config command to ensure that conda now points to conda-forge. This would also create a .condarc file in your $HOME, if it already doesn’t exist:

$ conda config --add default_channels https://conda.anaconda.org/conda-forge/ 

$ cat ~/.condarc 
default_channels: - https://conda.anaconda.org/conda-forge/ 

$ conda info 
... 
channel URLs : 
https://conda.anaconda.org/conda-forge/linux-64 
https://conda.anaconda.org/conda-forge/noarch
...
]]>
27591
FASRC Guidelines for OpenAI Key and Harvard Agreement https://docs.rc.fas.harvard.edu/kb/openai-guidelines/ Fri, 17 May 2024 19:10:10 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27073 FASRC allows the use of OpenAI on the Cannon cluster. Our users are free to install the tool locally under their profile on the cluster and provide it with data. However, we are asking our users to be aware of the initial guidelines [sometimes Chrome might not work for these HUIT websites, especially if you are “not” on university VPN. In that case, you can access these websites using Firefox] that the University has put forward for the use of such tools at Harvard.

OpenAI on the cluster

Following is a set of guidelines that we have put together for interested parties to ensure safe usage of OpenAI tools on the cluster and the data that is provided to them. 

There are two ways to get started on the process of getting your OpenAI account attached with Harvard’s enterprise agreement. 

Option 1 (needs a credit card) – Best executed by the PI

  1. Create a lab/PI/school-based OpenAI account using an email address and password of your choosing that can be easily distributed to other members of the group.
  2. This will generate an OpenAI API key that would need to be stored safely. The key will be required by other members of the account to install and use OpenAI on the cluster.
  3. After creating the account, go to your profile and click on Settings.
  4. Look for Organization ID on that page.
  5. Copy that ID and send an email using the template below to ithelp@harvard.edu or generativeAI@fas.harvard.edu to get this associated with the enterprise agreement that Harvard has with OpenAI. It could take up to a week for the association to take place. 
  6. This would ensure that whatever data is provided to OpenAI stays within that agreement and is not made public by the company.
  7. Attach a credit card with this account that will be used for billing. This could be the PI or division’s C-card. 
  8. Once the newly created OpenAI account has been associated with Harvard’s agreement and a credit card has been attached to it, the PI or the manager of this OpenAI account can now add members/students by going to the Settings page and inviting new members to the group.
  9. The OpenAI API key will also have to be shared with new members.
  10. At the point, any member of that account is now ready to install OpenAI on the cluster using the instructions on https://github.com/fasrc/User_Codes/tree/master/AI/OpenAI   
  11. The member will also need to be made aware of the data classification level that is attached to a certain GAI tool and its subsequent use on the cluster. 

Note: FASRC has tested using the OpenAI key for installing and using the corresponding software with Option# 1.

Email Template

Subject: Request to Associate OpenAI Organization ID with Harvard Enterprise Agreement

Dear HUIT GenAI Support Team,

I am requesting assistance in associating my OpenAI Organization ID with Harvard University’s Enterprise Agreement (EA) with OpenAI. 

As a Harvard affiliate, I want to utilize the OpenAI APIs, which offer increased API rate limits and the ability to use level 3 data, under the terms of the EA established by Harvard.

Organization ID:

  • My Organization ID is [Your Organization ID Here]. You can find this ID in the OpenAI API portal under Settings -> Organization (platform.openai.com).

I understand that once my Organization ID is submitted, the association with the Enterprise Agreement could take up to a week to be confirmed by OpenAI. Additionally, I acknowledge that API-related billing will be charged to the credit card on record with my OpenAI account, as OpenAI does not support PO/invoice billing.

Regards,

<Name>

Option 2 (HUIT recommended – needs 33-digit billing code) – Best executed by the PI

Individual Account:

  1. Go to Harvard’s API portal.
  2. Click on either of the two options: 
    1. AI Services – OpenAI Direct API
    2. Or AI Services – OpenAI via Azure
  3. Follow the instructions given on the corresponding page. 
  4. For example, for either of the options, one will have to fill out the HUIT billing form for new customers located on HarvardKey – Harvard University Authentication Service to obtain a customer account number followed by registering the app using the API portal’s Guides – Register an App | prod-common-portal
  5. Once the app is registered, you should be able to receive the corresponding API key (not the OpenAI key) from HUIT. 
  6. The API key is already associated with the enterprise agreement that Harvard has with OpenAI, so need to get that associated by sending HUIT an email as mentioned in Option# 1.
  7. Use this API key to install OpenAI on the cluster using the instructions on https://github.com/fasrc/User_Codes/tree/master/AI/OpenAI 

Team Account:

This feature on the API portal allows developer teams to “own” an API consumer app, instead of individuals. See Guides – Create a Team | prod-common-portal

Following are the steps a PI can take to create a team and add members to it so that they can access the API key associated with the “team” account. The owner can manage the team (add new people as time goes on, or remove them).  Each member of the team can log into the portal to access the API key for their app. 

Please be sure to enter the email addresses carefully in the team setup, and they should be the email addresses associated with their HarvardKey.

  1. Create the team and list each developer.
  2. Register the app and select the desired team as the app owner.
  3. In the app registration, select the APIs you want access to.

Note: FASRC has not verified the use of API key for installing the OpenAI software on the cluster with Option# 2.

Reminder: All our users are allowed to work with data classified as Level 2 on Cannon and Level 3 on FASSE. The member will also need to be made aware of the data classification level that is attached to a certain GAI tool and its subsequent use on the cluster. The university has guidelines on what sort of data can and cannot be exposed to third party systems. Please see: Guidelines for Using ChatGPT and other Generative AI tools at Harvard.

Resources:

  1. AI @ The FAS 
  2. Generative AI @ Harvard 
  3. Generative Artificial Intelligence (AI) Guidelines | Harvard University Information Technology  
  4. Comparison of Generative AI Tools at Harvard
  5. HUIT AI Services – OpenAI Direct API
  6. Guides – Create a Team | prod-common-portal
]]>
27073
VSCode Remote Development via SSH and Tunnel https://docs.rc.fas.harvard.edu/kb/vscode-remote-development-via-ssh-or-tunnel/ Wed, 10 Apr 2024 16:35:35 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=26927 This document provides the necessary steps needed to setup a remote connection between your local VS Code and the Cannon cluster using two approaches: SSH and Tunnel. These options could be used to carry out the remote development work on the cluster using VS Code with seamless integration of your local environment and cluster resources.

Important:
We encourage all our users to utilize the Open On Demand (OOD) web interface of the cluster to launch VS Code when remote development work is not required. The instructions to launch VS Code using the Remote Desktop App are given here

Prerequisites

  1. Recent version of VS Code installed for your local machine
  2. Remote Explorer and Remote SSH extensions installed, if not already present by default

FASRC Recommendation

Based on our internal evaluation of the three approaches, mentioned below, and the interaction with our user community, to launch VSCode on a compute node, we recommend our users utilize Approach I: Remote – Tunnel via batch job over the other two. The Remote – Tunnel via batch job approach submits a batch job to the scheduler on the cluster, thereby providing resilience toward network glitches that could disrupt VSCode session on a compute node if launched using Approach II or III.

Note: We limit our users to a maximum of 5 login sessions, so be aware of the number of VSCode instances you spawn on the cluster.


Approach I: Remote – Tunnel via batch job

Note: The method described here and in Approach II will launch a single VS Code session at a time for a user on the cluster. The Remote – Tunnel approaches do not support concurrent sessions on the cluster for a user. 

In order to establish a remote tunnel between your local machine and that of the cluster, as an sbatch job, execute the following steps.

  1. Copy the vscode.job script:
    #!/bin/bash
    #SBATCH -p test         # partition. Remember to change to a desired partition
    #SBATCH --mem=4g        # memory in GB
    #SBATCH --time=04:00:00 # time in HH:MM:SS
    #SBATCH -c 1            # number of cores
    
    set -o errexit -o nounset -o pipefail
    MY_SCRATCH=$(TMPDIR=/scratch mktemp -d)
    echo $MY_SCRATCH
    
    #Obtain the tarball and untar it in $MY_SCRATCH location to obtain the
    #executable, code, and run it using the provider of your choice
    curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' | tar -C $MY_SCRATCH -xzf -
    
    #VSCODE_CLI_DISABLE_KEYCHAIN_ENCRYPT=1 $MY_SCRATCH/code tunnel user login --provider github
    VSCODE_CLI_DISABLE_KEYCHAIN_ENCRYPT=1 $MY_SCRATCH/code tunnel user login --provider microsoft
    
    #Accept the license terms & launch the tunnel
    $MY_SCRATCH/code tunnel --accept-server-license-terms --name cannontunnel

    vscode.job script uses the microsoft provider authentication using HarvardKey. If you would like to change the authentication method to github, substitute microsoft -> github and make changes to the vscode.job script accordingly.

  2. Submit the job from a private location (somewhere that only you have access to, for example your $HOME directory) from which others can’t see the log file

    $ sbatch vscode.job
  3. Look at the end of the output file
    $ tail -f slurm-32579761.out
    ...
    To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ABCDEFGH to authenticate.
    Open a web browser, enter the URL, and the code. After authentication, wait a few seconds to a minute, and print the output file again:
    $ tail slurm-32579761.out
    *
    * Visual Studio Code Server
    *
    * By using the software, you agree to
    * the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
    * the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
    *
    Open this link in your browser https://vscode.dev/tunnel/cannon/n/home01/jharvard/vscode
  4. Now, you have two options
    1. Use a web client by opening vscode.dev link from the output above on a web browser.
    2. Use vscode local client — see below

Using vscode local client (option #2)

  1. In your local vscode (in your own laptop/desktop), add the Remote Tunnel extension (ms-vscode.remote-server)
    1. On the local VSCode, install Remote Tunnel extension
    2. Click on VS Code Account menu, choose “Turn on Remote Tunnel Access
  2. Connect to the cluster:
    1. Click on the bottom right corner
    2. Options will appear on the top text bar
    3. Select “Connect to Tunnel…”
  3. Then choose the authentication method that you used in vscode.job, microsoft or github
  4. Click on the Remote Explorer icon and pull up the Remote Tunnel drop-down menu
  5. Click on cannontunnel to get connected to the remote machine either in the same VS Code window (indicated by ->) or a new one (icon besides ->).
    Prior to clicking, make sure you see: Remote -> Tunnels -> cannontunnel running
  6. Finally, when you get vscode connected, you can also open a terminal on vscode that will be running on the compute node where your submitted job is running.

Enjoy your work using your local VSCode on the compute node.


Approach II: Remote – Tunnel interactive

In order to establish a remote tunnel between your local machine and that of the cluster, as an interactive job, execute the following steps. Remember to replace <username> with your FASRC username.

  1. ssh <username>@login.rc.fas.harvard.edu

  2. curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz

  3. tar -xzf vscode_cli.tar.gz

  4. An executable, code, will be generated in your current working directory. Either keep it in your $HOME or move it to your LABS folder, e.g.
    1. mv code /n/holylabs/rc_admin/Everyone/
  5. Add the path to your ~/.bashrc so that the executable is always available to you regardless of the node you are on, e.g.,
    1. export PATH=/n/holylabs/rc_admin/Everyone:$PATH
  6. Save ~/.bashrc, and on the terminal prompt, execute the command: source ~/.bashrc
  7. Go to a compute node, e.g.: salloc -p gpu_test --gpus 1 --mem 10000 -t 0-01:00
  8. Execute the command: code tunnel
  9. Follow the instructions on the screen and log in using either your Github or Microsoft account, e.g.: Github Account
  10. To grant access to the server, open the URL https://github.com/login/device and copy-paste the code given on the screen
  11. Name the machine, e.g.: cannoncompute
  12. Open the link that appears in your local browser and follow the authentication process as mentioned in steps# 3 & 4 of https://code.visualstudio.com/docs/remote/tunnels#_using-the-code-cli
  13. Once the authentication is complete, you can either open the link that appears on the screen on your local browser and run VS Code from there or launch it locally as mentioned below.
  14. On the local VSCode, install Remote Tunnel extension
  15. Click on VS Code Account menu, choose “Turn on Remote Tunnel Access
  16. Click on cannoncompute to get connected to the remote machine either in the same VS Code window (indicated by ->) or a new one (icon besides ->). Prior to clicking, make sure you see:
    Remote -> Tunnels -> cannoncompute running
The remote tunnel access should be on and the tunnel should come up as running prior to starting the work on the compute node.

Note: Every time you access a compute node, the executable, code, will be in your path. However, you will have to repeat step#10 before executing step#16 above in order to start a fresh tunnel. 


Approach III: Remote – SSH

In order to connect remotely to the cluster using VS Code, you need to edit the SSH configuration file on your local machine.

  • For Mac OS and Linux users, the file is located at ~/.ssh/config. If it’s not there, then create a file with that name.
  • For Windows users, the file is located at C:\Users\<username>\.ssh\config. Here, <username> refers to your local username on the machine. Same as above, if the file is not present, then create one.

There are two ways to get connected to the cluster remotely:

  1. Connect to the login node using VS Code.
    Important: This connection must be used for writing &/or editing your code only. Please do not use this connection to run Jupyter notebook or any other script directly on the login node.
  2. Connect to the compute node using VS Code.
    Important: This connection can be used for running notebooks and scripts directly on the compute node. Avoid using this connection for writing &/or editing your code as this is a non-compute work, which can be carried out from the login node. 

SSH configuration file

Login Node

Adding the following to your SSH configuration file will let you connect to the login node of the cluster only with the Single Sign-On option enabled. The name of the Host here is chosen to be cannon but you can name it to whatever you like, e.g., login or something else. In what follows, replace <username> with your FASRC username.

For Mac:

Host cannon
User <username>
HostName login.rc.fas.harvard.edu
ControlMaster auto
ControlPath ~/.ssh/%r@%h:%p

For Windows:

The SSH ControlMaster option for single sign-on is not supported for Windows. Hence, Windows users can only establish a connection to the login node by either disabling the ControlMaster option or not having that at all in the SSH configuration file, as shown below:

Host cannon
User <username>
HostName login.rc.fas.harvard.edu
ControlMaster no
ControlPath none

or

Host cannon
User <username>
HostName login.rc.fas.harvard.edu

Compute Node

In order to connect to the compute node of the cluster directly, execute the following two steps on your local machine:

Note: Establishing a remote SSH connection to a compute node via VSCode works only for Mac OS. For Windows users, this option is not supported and we recommend they utilize the Remote-Tunnel Approaches I or II for launching VSCode on a compute node.

  1. Generate a pair of public and private SSH keys for your local machine, if you have not done so previously, and add the public key to the login node of the cluster:
    In the ~/.ssh folder of your local machine, see if id_ed25519.pub is present. If not, then generate private and public keys using the command:

    ssh-keygen -t ed25519 -b 4096

    Then submit the public key to the cluster using the following command:

    ssh-copy-id -i ~/.ssh/id_ed25519.pub <username>@login.rc.fas.harvard.edu

    This will append your local public key to ~/.ssh/authorized_keys in your home directory ($HOME) on the cluster so that your local machine is recognized.

  2. Add the following to your local ~/.ssh/config file by replacing <username> with your FASRC username. Make sure that the portion for connecting to the login node from above is also present in your SSH configuration file. You can edit the name of the Host to whatever you like or keep it as compute. There are two ProxyCommand examples shown here to demonstrate how the ProxyCommand can be used to launch a job on a compute node of  the cluster with a desired configuration of resources through the salloc command. Uncommenting the first one will launch a job on the gpu_test partition of the Cannon cluster whereas uncommenting the second one will launch it on the test partition.

Host compute
UserKnownHostsFile=/dev/null
ForwardAgent yes
StrictHostKeyChecking no
LogLevel ERROR
# substitute your username here
User <username>
RequestTTY yes
# Uncomment the command below to get a GPU node on the gpu_test partition. Comment out the 2nd ProxyCommand
#ProxyCommand ssh -q cannon "salloc --immediate=180 --job-name=vscode --partition gpu_test --gres=gpu:1 --time=0-01:00 --mem=4GB --quiet /bin/bash -c 'echo $SLURM_JOBID > ~/vscode-job-id; nc \$SLURM_NODELIST 22'"

# Uncomment the command below to get a non-GPU node on the test partition. Comment out the 1st ProxyCommand
ProxyCommand ssh -q cannon "salloc --immediate=180 --job-name=vscode --partition test --time=0-01:00 --mem=4GB --quiet /bin/bash -c 'echo $SLURM_JOBID > ~/vscode-job-id; nc \$SLURM_NODELIST 22'"

Note: Remember to change the Slurm directives, such as --mem, --time, --partition, etc., in the salloc command based on your workflow and how you are planning to use the VSCode session on the cluster. For example, if the program you are trying to run needs more memory, then it is best to request that much amount of memory using the --mem flag in the salloc command prior to launching the VSCode session on the cluster otherwise it could result in Out Of Memory error.  

Important: Make sure to pass the name of Host being used for the login node to the ProxyCommand for connecting to a compute node. For example, here, we have named the Host as cannon for connecting to the login node. The same name, cannon is then being passed to the Proxycommand to establish connection to a compute node via ssh. Passing any other name to Proxycommand ssh -q would result in a connection not being established error.

SSH configuration file with details for establishing connection to the login (cannon) and compute (vscode/compute) node.

Once the necessary changes have been made to the SSH configuration file, open VS Code on your local machine and click on the Remote Explorer icon on the bottom left panel. You will see two options listed under SSH – cannon and compute (or whatever name you chose for the Host in your SSH configuration file).

Option to connect to the login (cannon) or compute (vscode) node under SSH after clicking on the Remote Explorer icon.

Connect using VS Code

Login Node

Click on the cannon option and select whether you would like to continue in the same window (indicated by ->) or open a new one (icon next to ->). Once selected, enter your 2FA credentials on the VS Code’s search bar when prompted. For the login node, a successful connection would look like the following.

Successful connection to the login node showing $HOME under Recent, nothing in the output log, and the Status bar on the lower left corner would show SSH:cannon.

 

Compute Node

In order to establish a successful connection to Cannon’s compute node, we need to be mindful that VS Code requires two connections to open a remote window (see the section “Connecting to systems that dynamically assign machines per connection” in VS Code’s Remote Development Tips and Tricks). Hence, there are two ways to achieve that.

Option 1

First, open a connection to cannon in a new window on VS Code by entering your FASRC credentials and then open another connection to compute/vscode on VS Code either as a new window or continue in the current window. You will not have to enter your credentials again to get connected to the compute node since the master connection is already enabled through the cannon connection that you initiated earlier on VS Code.

Successful connection to the compute node with the Status bar showing the name of the host it is connected to and under SSH, “connected” against that name.
Option 2

If you don’t want to open a new connection to cannon, then open a terminal on your local machine and type the following command, as mentioned in our Single Sign-on document, and enter your FASRC credentials to establish the master connection first.

ssh -CX -o ServerAliveInterval=30 -fN cannon

Then open VS Code and directly click on compute/vscode to get connected to the compute node. Once a successful connection is established, you should be able to run your notebook or any other script directly on the compute node using VS Code.

Note: If you have a stale SSH connection to cannon running in the background, it could pose potential problems. The session could be killed in the following manner.

$ ssh -O check cannon
Master running (pid=#)
$ ssh -O exit cannon
Exit request sent.
$ ssh -O check cannon
Control socket connect(<path-to-connection>): No such file or directory

VSCode For FASSE

One cannot use Remote SSH to establish connection between your local device and a FASSE compute node via VSCode as salloc is not permitted on FASSE due to security reasons. However, one can use Remote SSH to connect to FASSE’s login node using VSCode.

For establishing a connection to the compute node on FASSE, launching a Remote Tunnel using Approach I is your only option. However, prior to doing that, make sure you enable access to the internet to launch Remote Tunnel on your browser. Additionally, once the tunnel is established, remember to install extensions as needed prior to starting your work using VSCode.


Add Folders to Workspace on VSCode Explorer

Once you are able to successfully launch a VSCode session on the cluster, using one of the approaches mentioned above, you might need to access various folders on the cluster to execute your workflow. One can do that using the Explorer feature of VSCode. However, on the VSCode remote instance, when you click on Explorer  -> Open Folder, it will open $HOME, by default, as shown below.

Screen Shot

In order to add another folder to your workspace, especially housed in locations such as netscratch, holylabs, holylfs, etc., do the following:

  1. Type: >add on VSCode Search-Welcome bar and choose Workspaces: Add Folders to Workspace... . See below:Screenshot
  2. If you would like to add your folder on netscratch or holylabs or some such location, first open a terminal on the remote instance of VSCode and type that path. Copy the entire path and then paste it on the Search-Welcome bar. See below:Do not start typing the path in the Search-Welcome bar, make sure to copy-paste the full path otherwise VS Code may hang while attempting to list all the subdirectories of that location, e.g., /n/netscratch.

  3. Click ok to add that folder to your workspace
  4. On the remote instance, you will be prompted to answer whether you trust the authors of the files in this folder and then Reload.ScreenshotGo ahead and click yes (if you truly trust them), and let the session reload.
  5. Now you will be able to see your folder listed under Explorer as Untitled (Workspace)
  6. This folder would be available to you in your workspace for as long as the current session is active. For new sessions, repeat steps #1-4 to add desired folder(s) to your workspace

 Best Practices

  • Maximum of 5 login sessions are allowed per user at a time. Be aware of the number of VS Code instances you spawn on the cluster 
  • Login node session 
    • Use for writing &/or editing your code only 
    • Do not use it to run Jupyter notebook, R, Matlab, or any other script
  • Compute node session
    • Use for running notebooks & scripts 
    • Avoid using for writing &/or editing your code as this is a non-compute work
  • For interactive sessions, better to be on VPN to get stable connection
  • Remember to close jobs that are launched through interactive or sbatch VSCode sessions as follows:
    • click on the icon next to Launchpad on VSCode GUI
    • click on “Close Remote Connection”
  • For Remote Tunnel session, if VS Code work is complete, in addition to above, execute the following to ensure that the Slurm job is also cancelled:
    • squeue -u <username>
    • scancel <JOBID>

Troubleshooting VSCode Connection Issues

  1. Make sure that you are on VPN to get a stable connection.
  2. Match the name being used in the SSH command to what was declared under “Host” in your SSH config file for the login node.
  3. Make sure that the --mem flag has been used in the ProxyCommand in your SSH config file and that enough memory is being allocated to your job. If you already have it, then try increasing it to see if that works for you.
  4. Open a terminal and try connecting to a login and compute node (if on Mac) by typing: ssh <Host> (replace Host with the corresponding names used for login and compute nodes). If you get connected, then your SSH configuration file is set properly.
  5. Consider commenting out conda initialization statements in ~/.bashrc to avoid dealing with plausible issues caused due to initialization.
  6. ​Delete the bin folder from .vscode-server or .vscode &/or removing Cache, CachedData, CachedExtensionsVSIXs, Code Cache, etc. folders. You can find these on Cannon on $HOME/.vscode/data &/or $HOME/.vscode-server/data/.
  7. Check your $HOME quota and remove plausible culprits, such as ~/.cache. See instructions to clear disk space on Home directory full.
  8. Make sure that there are no lingering SSO connections. See the Note at the end of Approach III – Remote SSH section.
  9. Failed to parse remote port: Try removing .lockfiles.
  10. Try implementing Approach I – Remote Tunnel via batch job to see if you are able to launch a Remote Tunnel as an sbatch job in the background to ensure that your work is not getting disrupted by any network glitches.
  11. If you continue to have problems, consider coming to our office hours to troubleshoot this live

Dev Containers for custom software environments

VS Code supports opening a source code repository in a container. The container software environment is defined in a JSON file that adheres to the Development Containers (dev containers) specification (https://containers.dev/). The same git repository can be used to develop and run code in the same software environment on the FASRC cluster, a GitHub Codespace, or your laptop.

Using Dev Containers on the FASRC Cluster

It is recommended to (1) use VS Code in an OOD Remote Desktop session, or (2) use Approach I: Remote – Tunnel via batch job with a VS Code local client (the Visual Studio Code Dev Container extension is currently not supported in the web-browser-based VS Code (https://vscode.dev) with a remote tunnel (microsoft/vscode-remote-release issue #9059)).

Setup

  1. Install the Visual Studio Code Dev Containers extension
  2. Configure the extension to use Podman on the FASRC cluster:
    Change the following settings in Code > Settings > Extensions > Dev Containers:

    1. Dev > Containers: Docker Path – change “docker” to “podman”
    2. Dev > Containers: Docker Socket Path – change “/var/run/docker.sock” to “/tmp/podman-run-<uid>/podman/podman.sock”, replacing “<uid>” with your FAS RC user ID. Your FASRC user ID can be determined by running the command “id -u” on the FASRC cluster:
      [jharvard@holylogin05 ~]$ id -u
      21442
      NOTE: if using a local VS Code (not OOD Remote Desktop), you will need to revert these changes in the future if you use dev containers via a local Docker installation
VS Code Dev Container extension configuration to use Podman on the FASRC cluster
VS Code Dev Container extension configuration to use Podman on the FASRC cluster

Launch

See the VS Code dev container quick start for how to open a repository in a dev container.

Known Limitations

The following are known limitations using VS Code Dev Containers on the FASRC cluster with Podman:

]]>
26927
TensorFlow https://docs.rc.fas.harvard.edu/kb/tensorflow/ Wed, 11 Apr 2018 22:08:39 +0000 https://www.rc.fas.harvard.edu/?page_id=17937 Description

TensorFlow (TF) is an open-source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code.

TensorFlow was originally developed by researchers and engineers working on the Google Brain team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research. The system is general enough to be applicable in a wide variety of other domains, as well.

Installation:

The below instructions are intended to help you set up TF on the FASRC cluster.

GPU Version

The specific example illustrates the installation of TF version 2.16.1 with Python version 3.10, CUDA version 12.1.0, and CUDNN version 9.0.0.312. Please refer to our documentation on running GPU jobs on the FASRC cluster.

The two recommended methods for setting up TF in your user environment are installing TF in a conda environment in your user space or using a TF singularity container.

Installing TF in a Conda Environment

You can install your own TF instance following these simple steps:

# Load required software modules, e.g.,
module load python

# Create a new conda environment with Python:
mamba create -n tf2.16.1_cuda12.1 python=3.10 pip wheel

# Activate the new conda environment, e.g.,
source activate tf2.16.1_cuda12.1

# Install CUDA and cuDNN with conda/mamba and pip:
mamba install -c "nvidia/label/cuda-12.1.0" cuda-toolkit=12.1.0
pip install nvidia-cudnn-cu12==9.0.0.312

# Configure the system paths. You can do it with the following command every time you start a new terminal after activating your conda environment:
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib

# For your convenience, it is recommended that you automate it with the following commands. The system paths will be automatically configured when you activate this conda environment:
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

# Install extra packages required for data analytics, e.g.,
pip install scipy pandas matplotlib seaborn h5py jupyterlab jupyterlab-spellchecker scikit-learn

# Install TF plus required GPU libraries with pip, e.g.,
pip install --upgrade tensorflow[and-cuda]==2.16.*

# Set up the KERAS backend (required for KERAS version 3.0 and above)
export KERAS_BACKEND="tensorflow"

NOTE: Starting with version 2.16.1, TF includes KERAS version 3.0. Please, refer to the TensorFlow 2.16.1 release notes for important changes.

Pull a TF Singularity Container

Alternatively, one can pull and use a TensorFlow singularity container:

singularity pull --name tf2.16.1_gpu.simg docker://tensorflow/tensorflow:2.16.1-gpu

This will result in the image tf2.16.1_gpu.simg. The image then can be used with, e.g.,

$ KERAS_BACKEND="tensorflow" singularity exec --nv tf2.16.1_gpu.simg python3
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
>>> import tensorflow as tf
>>> print(tf.__version__)
2.16.1
>>> print(tf.reduce_sum(tf.random.normal([1000, 1000])))
tf.Tensor(1365.5554, shape=(), dtype=float32)
>>> print(tf.config.list_physical_devices('GPU'))
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]

Note: Please notice the use of the --nv option. This is required to make use of the NVIDIA GPU card on the host system. Please also notice the use of KERAS_BACKEND="tensorflow" environment variable, which is required to set the KERAS backend to TF.

Alternatively, you can pull a container from the NVIDA NGC Catalog, e.g.,

singularity pull docker://nvcr.io/nvidia/tensorflow:24.03-tf2-py3

This will result in the image tensorflow_24.03-tf2-py3.sif, which has TF version 2.15.0.

The NGC catalog provides access to optimized containers of many popular apps.

CPU Version

Similarly to the GPU installation, you can either install TF in a conda environment or use a TF singularity container.

Installing TF in a Conda Environment

# (1) Load required software modules
module load python/3.10.13-fasrc01

# (2) Create conda environment
mamba create -n tf2.16.1_cpu python=3.10 pip wheel

# (3) Activate the conda environment
source activate tf2.16.1_cpu

# (4) Install required packages for data analytics, e.g.,
mamba install -c conda-forge numpy scipy pandas matplotlib seaborn h5py jupyterlab jupyterlab-spellchecker scikit-learn

# (5) Install a CPU version TF with pip
pip install --upgrade tensorflow-cpu==2.16.*

# (6) Set up KERAS backend to use TF
export KERAS_BACKEND="tensorflow"

Pull a TF Singularity Container

singularity pull --name tf2.12_cpu.simg docker://tensorflow/tensorflow:2.12.0

This will result in the image tf2.12_cpu.simg. The image then can be used with, e.g.,

KERAS_BACKEND="tensorflow" singularity exec tf2.16.1_cpu.simg python3 -c "import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'; import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
tf.Tensor(2878.413, shape=(), dtype=float32)

Running TensorFlow:

Run TensorFlow Interactively

For an interactive session to work with the GPUs, you can use the following:

salloc -p gpu_test -t 0-06:00 --mem=8000 --gres=gpu:1

While on GPU node, you can run nvidia-smi to get information about the assigned GPU’s.

[username@holygpu7c26306 ~]$ nvidia-smi
Fri Apr  5 16:00:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:E3:00.0 Off |                   On |
| N/A   25C    P0              46W / 400W |    259MiB / 40960MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    2   0   0  |              37MiB / 19968MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Load required modules, and source your TF environment:

[username@holygpu7c26306 ~]$ module load python/3.10.13-fasrc01  && source activate tf2.16.1_cuda12.1 
(tf2.16.1_cuda12.1) [username@holygpu7c26306 ~]$ 

Test TF:

(Example adapted from here.)

(tf2.16.1_cuda12.1) [username@holygpu7c26306 ~]$ python tf_test.py
2.16.1
Epoch 1/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 5s 839us/step - accuracy: 0.7867 - loss: 0.6247  
Epoch 2/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 829us/step - accuracy: 0.8600 - loss: 0.3855
Epoch 3/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 3s 827us/step - accuracy: 0.8788 - loss: 0.3373   
Epoch 4/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 831us/step - accuracy: 0.8852 - loss: 0.3124
Epoch 5/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 828us/step - accuracy: 0.8912 - loss: 0.2915
Epoch 6/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 3s 830us/step - accuracy: 0.8961 - loss: 0.2773   
Epoch 7/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 828us/step - accuracy: 0.9025 - loss: 0.2625
Epoch 8/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 830us/step - accuracy: 0.9044 - loss: 0.2606
Epoch 9/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 828us/step - accuracy: 0.9081 - loss: 0.2489
Epoch 10/10
1875/1875 ━━━━━━━━━━━━━━━━━━━ 2s 829us/step - accuracy: 0.9109 - loss: 0.2405
313/313 - 2s - 6ms/step - accuracy: 0.8804 - loss: 0.3411

Test accuracy: 0.8804000020027161
313/313 ━━━━━━━━━━━ 1s 1ms/step   
[1.0222636e-07 7.9844620e-09 4.7857565e-11 5.2755653e-09 2.7131367e-10
 2.1757800e-04 5.9717085e-09 6.6847289e-03 4.5007189e-07 9.9309713e-01]

In the above example, we used the test code, tf_test.py from User Codes.

TensorFlow Singularity Image from Definition File

You may pull a singularity TensorFlow version 2.12.0 image with the below command:

# Pull a singularity container with version 2.12.0
singularity pull --name tf2.12_gpu.simg docker://tensorflow/tensorflow:2.12.0-gpu

This image comes with a number of basic Python packages. If you need additional packages, you could use the example singularity definition file tf-2.12.def to build the singularity image:

Bootstrap: docker
From: tensorflow/tensorflow:2.12.0-gpu

%post
    pip install --upgrade pip
    pip install matplotlib
    pip install seaborn
    pip install scipy
    pip install scikit-learn
    pip install jupyterlab
    pip install notebook

You could install additional packages directly in the image with pip by adding them in the %post section of the definition file as illustrated above. Please, refer to our documentation on how to build singularity images from definition files.

Examples:

References:

]]>
17937
GPU Computing on the FASRC cluster https://docs.rc.fas.harvard.edu/kb/gpgpu-computing-on-the-cluster/ Tue, 05 Jan 2010 15:59:49 +0000 http://rc-dev.rc.fas.harvard.edu/gpgpu-computing-on-odyssey/ Introduction

The FASRC cluster has a number of nodes that have NVIDIA general purpose graphics processing units (GPGPU) attached to them. It is possible to use CUDA tools to run computational work on them and in some use cases see very significant speedups.  Details on public partitions can be found here. For SEAS users, please check here for available partitions.

Training Session: GPU Computing on FASRC Clusters

You can download training slides from here.

Usage

GPGPU’s with sbatch

To request a single GPU on slurm just add #SBATCH --gres=gpu to your submission script and it will give you access to a GPU. To request multiple GPUs add #SBATCH --gres=gpu:n where ‘n’ is the number of GPUs. You can use this method to request both CPUs and GPGPUs independently. So if you want 1 CPU and 2 GPUs from our general use GPU nodes in the ‘gpu’ partition, you would specify:

#SBATCH -p gpu
#SBATCH -n 1
#SBATCH --gres=gpu:2

When you submit a GPU job SLURM automatically selects some GPUs and restricts your jobs to those GPUs. In your code you reference those GPUs using zero-based indexing from [0,n) where n is the number of GPUs requested. For example, if you’re using a GPU-enabled Tensorflow build and requested 2 GPUs you would simply reference gpu:0 or gpu:1 from your code.

To request a specific type of GPU, you would need to add #SBATCH --gres=gpu:name:n where name is substituted for the GPU model being requested. The GPU models currently available on our cluster can be found here. See the official Nvidia website for more details.

Interactive Sessions

For an interactive session to work with the GPUs you can use following:

salloc -p gpu_test -t 0-01:00 --mem 8000 --gres=gpu:1

While on GPU node, you can run nvidia-smi to get information about the assigned GPU’s.

The partition gpu_requeue is a backfill partition similar to serial_requeue and will allow you to submit jobs to idle GPU enabled nodes. Please note that the hardware in that partition is heterogeneous.  SLURM is aware of the model name, and compute capability of the GPU devices each compute node has.

Name or compute capability can be requested as a constraint in your job submission.  When running in gpu_requeue, nodes with a specific model can be selected using --constraint=modelname,  or, more in general, nodes offering a card with a specific compute capability can be selected using  --constraint=ccx.x (e.g. --constraint=cc7.0 for compute capability 7.0).

For example if your code needs to run on devices with at least compute capability 3.7, you would specify:

#SBATCH -p gpu_requeue 
#SBATCH -n 1 
#SBATCH --gres=gpu:1
#SBATCH --constraint=cc3.7

CUDA Runtime

The current version of the Nvidia driver installed on all GPU-enabled nodes may vary over time so its best to request an interactive job and then run nvidia-smi

Tue Jun 8 06:12:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Tesla V1... On | 00000000:06:00.0 Off | 0 |
| N/A 44C P0 31W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

To load the toolkit and additional runtime libraries (cublas, cufftw, …) remember to always load the module for cuda in your Slurm  job script or interactive session.

$ module-query cuda

$ module load cuda/<version>

NOTE: In the past our Cuda installations were heterogeneous and different nodes on the cluster would provide different versions of the Cuda driver. For this reason might have used in your job submissions  the Slurm flags --constraint=cuda-$version  (for example –constraint=cuda-7.5)  to specifically request nodes that were supporting that version.
This is no longer needed as our cuda modules are the same throughout the cluster, and you should remove those flags from your scripts.

Using CUDA-dependent modules

CUDA-dependent applications are accessed on the cluster in a manner that is similar to compilers and MPI libraries. For these applications, a CUDA module must first be loaded before an application is available. For example, to use cuDNN, a CUDA-based neural network library from NVIDIA, the following command will work:

$ module load cuda/11.1.0-fasrc01 cudnn/8.0.4.30_cuda11.1-fasrc01

If you don’t load the CUDA module first, the cuDNN module is not available.

$ module purge
$ module load cudnn/8.0.4.30_cuda11.1-fasrc01
Lmod has detected the following error:
The following module(s) are unknown: “cudnn/8.0.4.30_cuda11.1-fasrc01”
Please use the command module-query or our user Portal to find available versions and how to load them.
More information on software modules can be found here, and how to run jobs here.

Example Codes

We experiment with different libraries based on user requests and try to document simple examples for our users.

Please visit https://github.com/fasrc/User_Codes

Note: Codes using CUDA need to be compiled on a GPU node. Additionally, installing GPU-enabled software, such as Tensorflow & PyTorch, must be carried out on a GPU ndoe. 

Performance Monitoring

Nvidia

Besides nvidia-smi and nvtop, Nvidia also provides Nsight and Data Center GPU Manager (DCGM) for monitoring job performance.  You can find a walkthrough on how to use DCGM here.  It is recommended to name the GPU group something other than allgpus as they do in their example.

Weights & Biases

Weights & Biases is also an excellent method for monitoring job performance. It can display on a per job basis plots for your job performance. Follow their guides for how to add monitoring to your jobs.

]]>
5372