This policy defines FAS RC standards and procedures for the retention and deletion of research data, outputs, temporary files, and associated digital resources managed by the FAS RC in support of research activities.
Scope:
This policy applies to all research data stored, processed, or managed on servers, workstations, cloud resources, storage systems, or backup media provisioned by the FAS Research Computing Service Group.
Data Retention:
Following the departure of faculty from the University, the associated primary department will assume responsibility for the maintenance, storage, and cost of housing the remaining research data.
Principal Investigators (PIs) should notify FAS RC 60 days prior to their departurefrom the University including the duration of any appointments (courtesy or associate), with instructions and next steps for remaining datasets.
For research data associated with completed or inactive research projects and/or departed faculty where no notice has been given to FAS RC as to where the research data should be stored:
The PIs Harvard affiliated primary department becomes responsible for the storage and cost of the research data. Closure of the PIs group and project in FAS RC will be used to track compliance.
The research data will be retained in the source storage directory for 2 years following project completion or inactivity. Completion of a project occurs after:
final reporting to the research sponsor
final financial close-out of a sponsored research award segment
final publication of research results
cessation of academic or scientific activity on a specific activity on a specific research project, regardless of whether its results are published, whichever is later.
Following 2 years of inactivity, data will be migrated to FASRC Long-Term Storage. The data will be retained for an additional 5 years to meet the University Data Retention guidelines. Following the completion of 5 years, the data can be deleted. Departments will be notified via email prior to the deletion.
Temporary and Scratch Storage:
Data stored in scratch or temporary directories may be deleted after 90 dayswithout notice to maximize available resources.
Deletion Procedures:
Faculty and/or departments will be notified in advance of research data being deleted, per the timelines above. If PIs or Faculty are no longer associated with the University, the relevant department leadership will be notified via email.
Data will be deleted using secure erasure methods in accordance with institutional IT security standards.
Requests for retention extension can be made in writing and are subject to approval by FASRC and the department; individuals requesting the extension will be responsible for all associated storage costs.
Ownership and Roles:
University: Harvard University owns all research data generated through projects conducted under its authority or using its resources. While PIs and researchers manage and safeguard the data, the University is ultimately responsible for compliance with legal and sponsor requirements, ensuring confidentiality and security.
Principal Investigators: Principal Investigators (PIs) are stewards of research data. If PIs choose to delegate responsibility within their research groups, the PI remains accountable to the University for stewardship of the data. Principal Investigators are responsible for ensuring properdata management, storage, and accessibility, meeting all University, legal, and sponsorrequirements. This involves setting up procedures for data retention, confidentiality, and sharingwhile respecting data use agreements.
Departments: In the case that a PI has left the University without delegating responsibility for data, the associated primary department of the departed PI takes on the role of steward.
Researchers: Harvard community members who assist with management of data created, analyzed, and stored on FAS RC systems.
FAS RC: Responsible for executing deletions as outlined, maintaining logs of deletion actions, and responding to extension or exception requests.
Policy Review:
This policy will be reviewed and updated annually or as required by regulatory or operational changes.
Cluster storage offered and maintained by FASRC should only be used for research taking place on FASRC clusters.
Examples of data that can be stored on FASRC storage are:
Datasets
Code
Scientific software
Research results
Examples of data that should not be stored on FASRC storage include:
Clerical or lab administrative data
Data related to personnel, grant proposals, business operations, or general lab management
Data with personally identifiable or financial information
FASRC storage filesystems are only approved for Data Security Level 1 (DSL1) and DSL2 research data on the Cannon cluster. DSL3 data must be stored in the approved FASSE cluster project. Research data containing information classified as DSL 4 must be stored on an appropriate storage solution that is approved for DSL4 sensitive data.*
*A limited number of DSL4 projects exist in their own isolated environments
If it comes to the attention of the FASRC Staff that non research related data is being stored on the FASRC systems, we will alert the lab’s PI.
Harvard groups data into 5 data security levels depending on the sensitivity of the data. The DSL for data determines how that data must be managed.
DISCLAIMER: The information on this page relates only to the FASRC clusters and our current understanding of Harvard policy. Please refer to the Harvard Security Data Security Levels page for up-to-date university policies and information.
Web scraping is a contentious issue within research. While it is true that fair use provides for many uses of data gleaned from the Internet, in general this is applied to human information gathering, not programmatic machine scraping. That distinction makes the act of brute-force scraping an issue separate from fair use.
You, as a representative of Harvard, are not just using the source’s data, but also their servers, bandwidth, etc. in a way the source may not approve. This can lead to IP blacklisting and even legal action. So please tread carefully as your actions could negatively affect others.
Please be aware that merely being involved in academic pursuits does not exempt you from the usage policies of social media and other Internet platforms like Facebook, Twitter, etc.
Sensitive Data
If the data you are acquiring is considered sensitive, confidential, or contains human data, you will need to have this data reviewed for compliance before placing it on the FASRC cluster. If in doubt, you should always err on the side of caution and contact the Office of the Vice Provost for Research
If your research requires you to scrape content from the web, please review the following guidelines and suggestions.
We highlydiscourage using the cluster itself to scrape data. Due to its size and ease of parallelization of processes, the cluster is easily weaponized and your actions could have consequences for other researchers. Please seek another avenue for data acquisition first.
You should contact FASRC before commencing any scraping activity using the FASRC cluster.
It is highly preferable that you do the scraping elsewhere and then bring the data to the FASRC cluster for processing. If the data is sensitive, confidential, contains human data, or it is unclear, then this is a requirement. See ‘Sensitive Data’ above.
Also, if you are scraping for the purpose of training a GAI/LLM model, you should respect that site’s policies on this practice (this may be posted on the site, contained in a robots.txt file, or explicitly stated in their ToS). Even if you are doing the scraping manually, you should consider yourself the same as a bot and, if a site excludes GAI/AI bots, this also applies to you. Merely being an academic does not exempt you from following the wishes of a site and/or its members; your exfiltrated data could end up in other models thereby nullifying the source’s right to exclusivity/ownership. Please contact the Harvard Office of the General Counsel or Office of the Vice Provost for Research for further guidance.
Data on the Internet should not be programmatically (or ‘brute-force’) scraped using FASRC computing resources, even for academic research purposes, unless FASRC has given permission to proceed using the cluster or some system tied to the cluster, and:
A) The source provides an API for this purpose and any requirements they impose have been met.
B) The source allows/does not prohibit scraping in their terms of service or other public notice.
C) The source is the United States government and the data in question was generated with public funds and is publicly available without encumbrance. Further, that the site not be scraped using brute-force means if an API is provided.
D) The source has given you explicit permission in writing or via a secondary document spelling out that permission.
E) The source does not exclude/forbid your use-case, such as GAI or LLM training.
Data cannot be programmatically scraped using FASRC computing resources if the source has explicitly forbidden scraping in their terms of service and written permission to do so cannot be obtained. In such a case, you should investigate other options for acquiring this or similar data.
Throttling and Blacklisting
Scraping content from websites using highly parallelized processes, even with unfettered permission from the source, should be avoided. Doing so runs the risk of having the cluster, or even the university’s, IP range blacklisted. This could have an undesirable effect on other network and cluster users. Please ensure your processes pull data at a reasonable rate unless you explicitly have written approval from the data source to download more aggressively and assurance that this will not lead to blacklisting from them or their upstream provider.
FASRC does not delete any accounts once they are granted, we simply will disable an account to make it inactive.
Users can only have a single account, if needed we will move the sponsorship or upgrade the account to a more privileged role, we never issue new accounts once you are in our database. your account can be “rehydrated” again later.
Disabled Accounts
If you can’t log in it might be because your account has been disabled. Accounts could go into the Disabled” state for a number of reasons. Most commonly:
your account is idle for some time because you have not logged in to one of the FASRC services, (ssh into the cluster, log into SPINAL, use OOD, etc)
your PI retired or your Sponsoring PI asked us to remove you from their lab. Without a valid, active sponsor an account will be disabled
your account had an expiration date on it and that date has passed
your account has been compromised or we were asked to disable it for some other reason
In order to have your account re-enabled and rehydrated, we will need approval from your sponsor. Ideally, have your sponsor contact us and indicate that they wish your account to be re-enabled. You may also contact us, but bear in mind that we will still need to contact your sponsor for approval, so this will take slightly longer than if they contact us directly.
Again, signing up for an additional account if you already have or have ever had a FASRC account is never the correct answer. See: Add or Change Lab Groups
Account Sharing
Sharing accounts or account credentials is against university security policy. See: Sharing Accounts
This document outlines FAS Research Computing’s policies and procedures related to the onboarding of researchers and PIs. The document is structured as a checklist, to be utilized by researchers and PIs as they enter the university or join a new lab. The document also notates differences between the onboarding of researchers and faculty (PIs).
If you have a HarvardKey, but are denied access to approve new accounts, visit and complete the FAS Onboard tool for approvers.
Faculty can sponsor FASRC accounts for any researcher working in their lab, including external collaborators. If a collaborator does not have a Harvardkey account, they may apply for an external FASRC account. External accounts need to be reenabled every 90 days. PIs will need to request an extension every 90 days to prevent the account from being suspended.
Learn how to utilize the High Performance Compute cluster
Coldfront is a resource allocation management system FASRC adapted to manage allocations on the FASRC cluster. The platform enables the viewing and management of lab groups (Projects), and storage and cluster allocations (Allocations).
View information about storage folders associated with your group/lab
Utilize the Starfish Zones tool to view key information about your group’s storage folders. The Starfish Zone User Interface is a self-service visual tool that enables users to view group storage amounts and locations. Users can navigate folder structures to access detailed information about files and storage. Labs and groups are strongly recommended to utilize this tool to assist with their data organization and cleanup efforts.
View information about storage folders associated with your group/lab
Utilize the Starfish Zones tool to view key information about your groups storage folders. The Starfish Zone User Interface is a self-service visual tool that enables users to view group storage amounts and locations. Users can navigate folder structures to access detailed information about files and storage. Labs and groups are strongly recommended to utilize this tool to assist with their data organization and cleanup efforts.
As of December 2024, FASRC does not provide a general virtual machine service as part of its core services. It has in the past attempted to fill this gap when no other options were available, but 1) there was no funding for hardware or support for this service and its infrastructure is old and being retired 2) other options, within and without Harvard, now exist.
If you require a VM for web hosting or other needs or for hosting or sharing data sets, please see the following options.
While on FASSE nodes (compute, login, etc.) and the FASSE VPN, you have full access to the Internet through a proxy.
Generally, this means that you can push to or pull from any HTTPS, SFTP, or other service that supports a proxy.
For example, this means you should be able to pull data from data providers that provide an HTTPS, SFTP, or other service. You may need to adjust certain configurations and workflows to use the proxy – Some details on this here
With that said, given that FASSE is rated for data security level (DSL) 3 data:
Do not store DSL 3 / FASSE data in your home directory.
If you have a DUA that requires encryption at rest, you must not use scratch for any data that the DUA applies to. Neither local scratch, nor our global scratch, support encryption at rest.
FASSE VPN, login, compute, and VDI environments use a proxy. Some transfer solutions do not work through a proxy. If you run into this:
Open a ticket with rchelp@rc.fas.harvard.edu indicating
What you have tried
What you expected to happen
What actually happened
Include specific commands, where these ran, and output messages including all errors.
Data security level 3 / FASSE storage is intentionally not included in Globus by default. If you would like your FASSE project to be exposed through Globus, consider the following:
If any data in this project is governed by a contract / data use agreement (DUA), please review the DUA to ensure Globus is compliant. You might consult your School Security Officer for this.
An example scenario where Globus would not be compliant: DUAs indicating that a VPN or private network must be used for all access to the data. Globus makes data available over the Internet without a VPN or private network
Please submit a ticket to rchelp@rc.fas.harvard.edu as follows:
This must include the path to the project to add to Globus (e.g. “/n/piname_project_l3”)
This must indicate that the PI attests to Globus being compliant with any contracts/DUAs governing the data in this project storage
This must be from, or receive a reply directly from the PI for this project confirming this information
For Storage, FASSE storage is intentionally not provided SMB shares by default. If you need your FASSE project exposed through an SMB share, consider the following:
Please submit a ticket to rchelp@rc.fas.harvard.edu as follows:
This must include the path to the project (e.g. “/n/piname_project_l3”)
This must indicate that the PI attests to understanding and accepting the risks of enabling SMB access to this data, given that any system or network that can talk to this tiered storage, could access this data if the credentials from an account in the project were used. Some example scenarios:
Someone with access to your storage accesses it / copies data down to an unmanaged lab computer without data security level controls
Someone with access to your storage accidentally clicks the wrong link on a computer with access to this storage. Their computer is compromised, malware identifies SMB access to your data, and compromises the confidentiality, integrity, and/or availability of your data – maybe ransomware, stealing the data, etc.
This must include a brief explanation of why SMB access is needed, and from where you will use this SMB access
This must be from, or receive a reply directly from the PI for this project confirming this information
If you have any questions or concerns, please do not hesitate to consult us at at security@rc.fas.harvard.edu, although in some cases we may end up pulling in or pointing you to your school privsec officer.
PIs have a variety of responsibilities at Harvard University. This document will cover the responsibilities specific to FAS Research Computing, especially around information security and risk.
PIs are individuals given continuous or limited PI rights by the university and whom control their own funding in a school that FAS RC supports. Co-Investigators are not considered PIs.
PIs are responsible for ensuring all accounts they sponsor follow all applicable Harvard University policies, including but not limited to Harvard Research Data Security Policy and Harvard Information Security Policy, as well as any requirements in data use agreements or contracts that impact them.
PIs are responsible for creating and maintaining accurate data documentation in the Harvard Compliance System, as required by University policies, and complying with approved data security and management plans. Guidance on which applications are needed for your data.
PIs are responsible for submitting FASSE project requests for any data security level (DSL) 3 data they plan to use at FAS RC and keeping associated data in the specific FASSE storage provided for these projects.
PIs are responsible for informing FAS RC of any changes to Research Administration applications (e.g. DAT12-1234, DUA12-1234, IRB12-1234) governing data they plan to use for their FASSE projects, before moving new data to FAS RC storage for these projects. This includes informing FASRC before adding data from a new application (e.g. DUA12-1234) to an existing FASSE project.
PIs are responsible for ensuring that any access they approve complies with all applicable Harvard University policies and DUA or compliance regimes. For example, among many other scenarios:
If a DUA requires informing or obtaining approval from the data provider before providing access to the data, the PI must ensure this is done before they approve the associated FAS RC access
If a DUA states that only Harvard staff may have access to the data, the PI is responsible for ensuring they never approve access to non-Harvard members to that data (e.g. external collaborators)
PIs are responsible for informing FAS RC when an account they have sponsored should be disabled (i.e. if they sponsor the account and the person has left or should otherwise be disabled)
PIs are responsible for informing FAS RC when any accounts should be removed from groups they manage
PIs are responsible for informing FAS RC if and when data needs secure disposal/sanitization, either as required by Harvard University policy or a DUA
Upcoming Responsibilities
Coming soon: PIs are responsible for reviewing accounts they sponsor on an annual basis [1]
Coming soon: PIs are responsible for reviewing access to groups they manage on an annual basis [1]
[1] If you would like to review spreadsheets of accounts you sponsor and group memberships for groups you approve, please contact rchelp@rc.fas.harvard.edu ask for account and access review spreadsheets.
Step 3: When the Remote Desktop app opens, click the terminal icon to launch a terminal (or click Applications -> Terminal Emulator).
Step 4: Below, you can follow the instructions to launch various software.
Keep in mind that, for the most part, the terminal window must remain open. If the terminal window is closed, the software launched via the terminal will also be closed.
Training Session: FASRC Open On Demand Users Training
Remote Desktop login
To comply with Harvard’s security policy, if the Remote Desktop session becomes idle, the Remote Desktop session will lock. You need to enter your FASRC password to log back in.
Abaqus
In the terminal, type the commands to load the modules and launch Abaqus
(optional) Creating and loading a mamba/conda environment
Note: this is a one-time setup to ensure that your conda environment can be loaded in Jupyter Notebook.
See our Python documentation on how to create a conda environment.
Then, in order to see your conda environment in Jupyter Notebook, ensure that you have installed the packages ipykernel and nb_conda_kernels. To do so, launch a terminal in the Remote Desktop and type the commands:
After the jupyter notebook command, it may hang for a few seconds. Be patient, a Firefox window will open soon after.
To select my_conda_environment as the kernel, go to Kernel -> Change kernel, and select the kernel (i.e. conda environment) of your choice.
Note: If you prefer to launch Jupyter Lab, note that conda environments cannot be loaded when using Jupyter Lab. Only the base environment is available.
Cleanly close Jupyter Notebook
These are instructions to kill your Jupyter server and so you can exit the job cleanly.
First, close each Jupyter Notebook you have open: click on File -> Close and Halt.
Then, from the Jupyter Notebook Home Page (where you can browse files and folders), on the top right corner, click on “Quit”. Close the Firefox window.
KNIME
In the terminal, type the following commands to load the module and launch Knime.
You can see all versions of KNIME with module spider knime. For more details, see the modules page.
LibreOffice
LibreOffice is a free and open source suite that is compatible with a wide range of formats, including those from Microsoft Word (.doc, .docx), Excel (.xls, .xlsx), PowerPoint (.ppt, .pptx) and Publisher.
LibreOffice is available in the FASRC cluster (both Cannon and FASSE) through a Singularity image. Therefore, LibreOffice is only available through the Remote Desktop app. LibreOffice does not work in the Containerized Remote Desktop app.
In the terminal type the commands to pull and create a singularity image with LibreOffice installed within the container. This command is only needed once.
You can see all versions of R and RStudio with module spider R and module spider rstudio, respectively. For more details, see the modules page.
Remoteviz Partition
If you have used the “FAS-RC Remote Visualization” Open OnDemand (or VDI) app, we have decommissioned it.
SageMath
You can use sage wither in a interactive shell using command line interface or by launching a Jupyter Notebook with the SageMath kernel. To launch a Jupyter Notebook, in the terminal, type the commands to load the modules and launch Jupyter
You can see all versions of SageMath with module spider sage. For more details, see the modules page.
SAS
In the terminal, type the commands to load the modules and launch SAS
[jharvard@holy7c24102 ~]$ module load sas
[jharvard@holy7c24102 ~]$ sas &
Stata
In the terminal, type the commands to load the module and launch Stata
[jharvard@holy7c24102 ~]$ module load stata/17.0-fasrc01
# if you are using single-core jobs
[jharvard@holy7c24102 ~]$ xstata-se
# if you are using multi-core jobs
[jharvard@holy7c24102 ~]$ xstata-mp "set processors $SLURM_CPUS_PER_TASK"
TensorBoard
For TensorBoard, you will first need to create a conda environment (Step 1). You only need to create a conda environment once. If you have created one, you can skip to Step 2. Or, if you have your own environment, make sure you install the TensorBoard package, and then you can skip to Step 2.
Step 1: Create conda environment
In a terminal, load Mambaforge or Python module, create a mamba environment, activate it, and install TensorBoard inside the mamba environment
You can see different versions of Mambaforge or Python in our modules page.
Step 2: Activate conda environment and launch TensorBoard
In a terminal, setup variables for TensorBoard. Make sure that the data you need to visualize in Tensorboard is located in the log directory MY_TB_LOGDIR. You can either use the suggested path below or use somewhere else that better suits your workflow.
# Find available port to run server on (does not output anything to screen)
[jharvard@holy7c24102 ~]$ for myport in {6818..11845}; do ! nc -z localhost ${myport} && break; done
# setup tensorboard environmental variables
[jharvard@holy7c24102 ~]$ export MY_TB_PORT=${myport}
[jharvard@holy7c24102 ~]$ export MY_TB_BASEURL=/node/${host}/${myport}/
[jharvard@holy7c24102 ~]$ export MY_TB_LOGDIR=$HOME/.tensorboard/log/$SLURM_JOBID
[jharvard@holy7c24102 ~]$ mkdir -p $MY_TB_LOGDIR
# load module, activate conda environment, and launch tensorboard
[jharvard@holy7c24102 ~]$ module load python
[jharvard@holy7c24102 ~]$ module load cuda/11.7.1-fasrc01
[jharvard@holy7c24102 ~]$ module load cudnn/8.5.0.96_cuda11-fasrc01
[jharvard@holy7c24102 ~]$ source activate tb_tf2.10_cuda11
(tb_tf2.10_cuda11) tensorboard --host localhost --port ${MY_TB_PORT} --logdir ${MY_TB_LOGDIR} --path_prefix ${MY_TB_BASEURL}
You can see different versions of Mambaforge or Python in our modules page.
Right-click on the link that starts with “http://localhost” and click on “Open Link”. This will open a Firefox browser, where you can view your results.
Example
Using the environment created in Step 1, run the small program tb_test.py in a directory of your choice and visualize its results.
# Find available port to run server on (does not output anything to screen)
[jharvard@holy7c24102 tb_example]$ for myport in {6818..11845}; do ! nc -z localhost ${myport} && break; done
# go to the directory that you have your tb_test.py file
[jharvard@holy7c24102 ~]$ cd tb_example
# setup tensorboard environmental variables
[jharvard@holy7c24102 tb_example]$ export MY_TB_PORT=${myport}
[jharvard@holy7c24102 tb_example]$ export MY_TB_BASEURL=/node/${host}/${myport}/
# this command will set MY_TB_LOGDIR to your current working directory
[jharvard@holy7c24102 tb_example]$ export MY_TB_LOGDIR=$PWD
# load modules and activate conda environment
[jharvard@holy7c24102 tb_example]$ module load python
[jharvard@holy7c24102 tb_example]$ module load cuda/11.7.1-fasrc01
[jharvard@holy7c24102 tb_example]$ module load cudnn/8.5.0.96_cuda11-fasrc01
[jharvard@holy7c24102 tb_example]$ source activate tb_tf2.10_cuda11
# run python code
(tb_tf2.10_cuda11) python tb_test.py
# launch tensorboard
(tb_tf2.10_cuda11) tensorboard --host localhost --port ${MY_TB_PORT} --logdir ${MY_TB_LOGDIR} --path_prefix ${MY_TB_BASEURL}
Right click on the link that starts with “http://localhost” and click on “Open Link”. This will open a Firefox browser where you will be able to see your results.
TotalView
TotalView is a debugging tool particularly suitable for parallel applications. The modules you need to load depend on the compilers used in the code you are trying to debug. Due to this compiler dependency, we refer you to a more elaborate TotalView documentation.
Visual Studio Code
In the terminal, type the commands to load the modules and launch Visual Studio Code