Languages – FASRC DOCS https://docs.rc.fas.harvard.edu Wed, 08 Jan 2025 03:57:08 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://docs.rc.fas.harvard.edu/wp-content/uploads/2018/08/fasrc_64x64.png Languages – FASRC DOCS https://docs.rc.fas.harvard.edu 32 32 172380571 Cpp, C++ Programming Language https://docs.rc.fas.harvard.edu/kb/cpp-programming-language/ Tue, 30 Apr 2024 13:44:17 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=26934 Description

C++ (C plus plus) is an object-oriented high-level programing language. C++ files typically have .cpp as the file extension. You can compile C++ codes with either GNU compilers (gcc) or Intel compilers (intel).

Best Practice

We recommend requesting an interactive job to compile a C++ program on a compute node (instead of a login node). The compilation could take up to few seconds to a minute and depending on the complexity of the code. Additionally, it is best to utilize the test partition to compile and test a program before executing its production run on the cluster as a batch-job.

It is best practice to compile a C++ code separately and then use the executable, generated during compilation, in the production run using the sbatch script. If possible, avoid including the compilation command in the sbatch script, which will recompile the program every time the job is submitted. If any changes are made to the source code, compile the source code separately, and then submit the production run as a batch-job.

Compilers

You can compile a C++ code using either a GNU or an Intel compiler.

GNU compiler

To use C++ with gcc on the FASRC clusters, load gcc compiler via our module system. For example, this command will load the latest gcc version:

module load gcc

If you need a specific version of R, you can search with the command

module spider gcc

To load a specific version

module load gcc/10.2.0-fasrc01

For more information on modules, see the Lmod Modules page.

To compile a code using a specific version of the GNU compiler and the O2 optimization flag, you can do the following:

module load gcc 
g++ -O2 -o sum.x sum.cpp

Intel compiler

To use C++ with Intel on the FASRC clusters, load intel compiler via our module system. For example, this command will load the latest intel version:

module load intel

If you need a specific version of R, you can search with the command

module spider intel

To load a specific version

module load intel/24.0.1-fasrc01

For more information on modules, see the Lmod Modules page.

Intel recommendations and notes

  • Intel released Intel OneAPI 23.2 with icpx, however, this version does not contain all the features, so we highly recommend using Intel 24 for icpx
  • Intel 17 is quite old. Avoid using it as it can have many incompatibilities with the current operating system
  • Intel has changed its compiler in the past few years and each module may need different flags. Below is a table of executables and possible flags
Intel module versionCommandAdditional flag
intel/17.0.4-fasrc01icpc-std=gnu++98
intel/23.0.0-fasrc01icpc
intel/23.2.0-fasrc01icpx
intel/24.0.1-fasrc01icpx

To compile using a specific version of the Intel compiler, execute:

module load intel/24.0.1-fasrc01
icpx -O2 -o sum.x sum.cpp

Examples

FASRC User Codes

]]>
26934
Fortran Programming Language https://docs.rc.fas.harvard.edu/kb/fortran/ Tue, 30 Apr 2024 01:53:24 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=27017 Description
Fortran, short for Formula Translation, is one of the oldest high-level programming languages, first developed by IBM in the 1950s. It was primarily designed for numerical and scientific computing tasks, making it highly suitable for complex mathematical calculations. Known for its efficiency and speed, Fortran has undergone several revisions over the years, with the latest version being Fortran 2018. Despite its age, Fortran remains relevant in fields such as engineering, physics, and research where performance and numerical accuracy are paramount. Its robustness in handling large-scale scientific and engineering applications, array-oriented programming, and its ability to optimize code for parallel computing have contributed to its longevity in the realm of technical computing.

Fortran Compilers

Currently, the supported Fortran compilers on the FASRC Cannon cluster are GNU gfortranIntel ifx, and NVIDIA nvfortran.

GNU gfortran

GNU gfortran, part of the GNU Compiler Collection (GCC), is an open-source compiler known for its adherence to Fortran standards and portability across different platforms. To compile a Fortran program named example.f90 with GNU gfortran, you can use the following commands:

# Load a GCC software module, e.g.,
module load gcc/13.2.0-fasrc01

# Compile the program, e.g.,
gfortran -O2 -o example_gfortran.x example.f90

This command compiles example.f90 into an executable named example_gfortran.x. In the above example we also apply level 2 optimization with the -O2 compiler flag, improving the performance of the compiled code.

Intel ifx

Intel Fortran Compiler (ifx, formerly ifort ) is renowned for its robust optimization capabilities and superior performance, particularly on Intel architectures. Developed by Intel, ifx leverages advanced optimization techniques to generate highly efficient machine code tailored to Intel processors. When compiling Fortran code with ifx, developers can take advantage of optimizations such as auto-vectorization, inter-procedural optimization, and CPU-specific tuning to achieve significant performance improvements. To compile the same Fortran program example.f90 with Intel ifx, you can use the following command:

# Load an Intel software module, e.g.,
module load intel/24.0.1-fasrc01

# Compile the program, e.g.,
ifx -O2 -o example_ifx.x example.f90

Similar to gfortran, this command compiles example.f90 into an executable named example_ifx.x. Intel ifx typically performs optimization by default, but you can explicitly specify optimization level using the -O2 flag.

NVIDIA nvfortran (formerly PGI Fortran)

NVIDIA HPC SDK is a suite of compilers and tools designed for high-performance computing (HPC) applications, including Fortran. NVIDIA Fortran compiler provides extensive optimization capabilities tailored to both CPU and GPU architectures, making it a preferred choice for developers working in fields such as scientific computing, weather modeling, and computational fluid dynamics. When compiling Fortran code with NVIDIA nvfortran, developers can harness advanced optimizations like GPU offloading, where computationally intensive portions of the code are executed on NVIDIA GPUs for accelerated performance. Additionally, it offers support for directives such as OpenACC, allowing developers to easily parallelize and optimize their code for heterogeneous computing environments. While nvfortran excels in optimizing code for NVIDIA GPUs, it also delivers competitive performance on CPU architectures, making it a versatile choice for HPC development.

For compiling Fortran programs with NVIDIA nvfortran, the process may involve targeting both CPU and GPU architectures. Here’s an example command to compile example.f90 with nvfortran for a CPU target:

# Load a NVIDIA HPC SDK software module, e.g.,
module load nvhpc/23.7-fasrc01

# Compile the program, e.g.,
nvfortran -o example_nvfortran.x example.f90

This command produces an executable named example_nvfortran.x

Examples

To get started with Fortran on the Harvard University FAS cluster you can try the examples in our User Codes repository.

References

]]>
27017
C Programming Language https://docs.rc.fas.harvard.edu/kb/c-programming-language/ Tue, 23 Apr 2024 10:24:01 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=26991 Description

C is a general-purpose, procedural computer programming language supporting structured programming, lexical variable scope, and recursion, while a static type system prevents unintended operations. By design, C provides constructs that map efficiently to typical machine instructions and has found lasting use in applications previously coded in assembly language. Such applications include operating systems and various application software for computers, from supercomputers to embedded systems. Wikipedia

Best Practice

We recommend jumping to a compute node for compiling a C program as the compilation could take up to few seconds to a minute depending on the complexity of the code. Additionally, it is best to utilize the test partition to compile and test a program before executing its production run on the cluster as a batch-job.

It is best practice to compile a C code separately and then use the executable, generated during compilation, in the production run using the sbatch script. If possible, avoid including the compilation command in the sbatch script, which will recompile the program every time the job is submitted. If any changes are made to the source code, compile the source code separately, and then submit the production run as a batch-job.

Compilers

You can compile a C code using either a GNU or an Intel compiler.

GNU gcc compiler

To get a list of currently available GNU compilers on the cluster, execute: module spider gcc

The default GNU compiler is typically the latest compiler version on the cluster and can be loaded using module load gcc

To compile a code using a specific version of the GNU compiler and the O2 optimization flag, you can do the following:

module load gcc/9.5.0-fasrc01
gcc -O2 -o sum.x sum.c

Intel icc compiler

To get a list of currently available Intel compilers on the cluster, execute: module spider intel

To compile using a specific version of the Intel compiler, execute:

module load intel/23.0.0-fasrc01
icc -O2 -o sum.x sum.c

Note: If loading an intel module version, refer to the following table for compiling your C code

Intel Comiler Version C Fortran C++
Below 24.0.0 icc ifortran icpc
24.0.0 and above icx ifx icpx

If you load an Intel compiler that is lower than version 24.0.0, you might get this remark

icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.

This is just a warning that implies that the user of icc will be deprecated in the second half of 2023. You can quiet this warning by compiling your code using icc in the following manner:

module load intel/23.0.0-fasrc01
icc -O2 -diag-disable=10441 -o sum.x sum.c

Examples

To get started with C on the Harvard University FAS cluster you can try the example shown on our User Codes repository.

Resources

To learn and practice more in C, see the following:

]]>
26991
Bash https://docs.rc.fas.harvard.edu/kb/bash/ Tue, 02 Apr 2024 15:23:59 +0000 https://docs.rc.fas.harvard.edu/?post_type=epkb_post_type_1&p=26919 Description

GNU Bash or simply Bash is a Unix shell and command language written by Brian Fox for the GNU Project as a free software replacement for the Bourne shell. First released in 1989, it has been used widely as the default login shell for most Linux distributions and Apple’s macOS Mojave and earlier versions. Bash is the default environment and scripting language for the cluster.

Bash keeps two hidden files in your home directory which are executed when you log in.  The first is your .bash_profile which is executed when you initially ssh onto the cluster. The second is your .bashrc which is executed each time you start a shell.  Updating these files is how you customize your environment on the cluster. In general you want to keep these clean and as close to the defaults as possible, as the more you put in there the more your login will slow down. See our guide for editing your .bashrc for more details.

For more on Bash see the following guides:

Bash Syntax Checking

You can check the syntax of a bash script (including slurm submission scripts) doing bash -n <myscript>.

Examples

FASRC User Codes

]]>
26919
Julia Programming Language https://docs.rc.fas.harvard.edu/kb/julia/ Thu, 31 May 2018 15:47:00 +0000 https://www.rc.fas.harvard.edu/?page_id=18174 Description

Julia is a high-level, high-performance dynamic programming language for technical computing. It has syntax that is familiar to users of many other technical computing environments. Designed at MIT to tackle large-scale partial-differential equation simulations and distributed linear algebra, Julia features a robust ecosystem of tools for optimization, statistics, parallel computing, and data visualization. Julia is actively developed by teams at MIT and in industry, along with hundreds of domain-expert scientists and programmers from around the world on JuliaLang.

Installation

Julia can be easily installed by following the instructions on https://julialang.org/downloads/ for a Unix-type system or Windows. You can install Julia locally on the cluster using the command for the Unix-type system. The default location for the installation is $HOME. The command will also add Juliaup initialization in your ~/.bashrc, which will add the executable julia to your $PATH and make it available for you to use using command line interface (CLI).

Adding packages to Julia

One can utilize Julia’s full-featured interactive command-line REPL (read-eval-print loop) to add packages needed for running a Julia program successfully. For example, in order to run a program that solves differential equations numerically in Julia one would need to install DifferentialEquationsSimpleDiffEq, and Plots packages prior to running the differential equations program. This can be achieved in the following manner:

julia> using Pkg
julia> Pkg.add("DifferentialEquations")
julia> Pkg.add("SimpleDiffEq")
julia> Pkg.add("Plots")

Python/Conda in Julia

Similarly, one can install Conda.jl and PyCall.jl packages to enable conda for installing Python packages in Julia and to directly call Python functions from Julia. For example, in order to install Conda, matplotlib (using conda), and PyCall, one can do the following:

julia> Pkg.add("Conda")
julia> using Conda
julia> Conda.add("matplotlib")
julia> ENV["PYTHON"]=""
julia> Pkg.add("PyCall")
julia> Pkg.build("PyCall")

Note: The PYTHON environmental variable above has been set to blank prior to building the PyCall package. This is to override the default behavior of PyCall, which is to use the system’s default Python environment in Linux, and instead install Julia’s “private” version of Python. You can find more details on how to install PyCall at a desired location and call Python functions from a Julia program on PyCall.jl and Python from Julia.

Jupyter notebook in Julia

A Julia kernel can be set up to interact with the Julia language using Jupyter’s graphical notebook. The IJulia package binds the Julia kernel with Jupyter.

In order to install a package, you can bring the Julia package prompt pkg> by typing ] instead of using Pkg. Hence, the IJulia package can be installed as:

julia> ]
(@v1.10) pkg>
add IJulia
(@v1.10) pkg> build IJulia

To come out of the pkg> mode and get back to Julia REPL, press CTRL+C or backspace (see Pkg).

Alternatively, you can execute the using Pkg command to install IJulia as follows:

julia> using Pkg
julia> Pkg.add("IJulia")
julia> Pkg.build("IJulia")

Note: The installation of Julia packages could take significant time. Therefore, we recommend that Julia packages are installed on a compute node via an interactive session.

Examples

To get started with Julia on the Harvard University FAS cluster you can try the examples shown on our User Codes repository.

Resources

The Julia Programming Language
Julia Computing
Julia Documentation
Conda.jl
PyCall.jl

 

]]>
18174
MPI for Python (mpi4py) on the FASRC cluster https://docs.rc.fas.harvard.edu/kb/mpi-for-python-mpi4py-on-odyssey/ Mon, 06 Jun 2016 16:47:29 +0000 https://rc.fas.harvard.edu/?page_id=15263 Introduction

This web-page is intended to help you running MPI Python applications on the cluster cluster using mpi4py.
To use **mpi4py** you need to load an appropriate Python software module. We have the Anaconda Python distribution from Continuum Analytics. In addition to mpi4py, it includes hundreds of the most popular packages for large-scale data processing and scientific computing.
You can load python in your user environment by running in your terminal:

module load python/2.7.14-fasrc01

Example Code

Below is a simple example code using mpi4py.

#!/usr/bin/env python
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Program: mpi4py_test.py
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
from mpi4py import MPI
nproc = MPI.COMM_WORLD.Get_size()   # Size of communicator
iproc = MPI.COMM_WORLD.Get_rank()   # Ranks in communicator
inode = MPI.Get_processor_name()    # Node where this MPI process runs
if iproc == 0: print "This code is a test for mpi4py."
for i in range(0,nproc):
    MPI.COMM_WORLD.Barrier()
    if iproc == i:
        print 'Rank %d out of %d' % (iproc,nproc)
MPI.Finalize()

Running the program

You could use the following SLURM batch-job submission script to submit the job to the queue:

#!/bin/bash
#SBATCH -J mpi4py_test
#SBATCH -o mpi4py_test.out
#SBATCH -e mpi4py_test.err
#SBATCH -p shared
#SBATCH -n 16
#SBATCH -t 30
#SBATCH --mem-per-cpu=4000
module load python/2.7.14-fasrc01
srun -n $SLURM_NTASKS --mpi=pmi2 python mpi4py_test.py

If you name the above script run.sbatch, for instance, the job is submitted to the queue with

sbatch run.sbatch

Upon job completion, job output will be located in the file mpi4py_test.out with the contents:

This code is a test for mpi4py.
Rank 0 out of 16
Rank 1 out of 16
Rank 2 out of 16
Rank 3 out of 16
Rank 4 out of 16
Rank 5 out of 16
Rank 6 out of 16
Rank 7 out of 16
Rank 8 out of 16
Rank 9 out of 16
Rank 10 out of 16
Rank 11 out of 16
Rank 12 out of 16
Rank 13 out of 16
Rank 14 out of 16
Rank 15 out of 16

References

* MPI for Python
* mpi4py documentation

]]>
15263
Ruby https://docs.rc.fas.harvard.edu/kb/ruby/ Tue, 06 May 2014 15:58:29 +0000 https://rc.fas.harvard.edu/?page_id=11227 The Ruby programming language, known primarily for it’s strength as a web platform, has a growing list of useful packages for scientific programming, visualization, and data management.

Use rbenv to setup a personal Ruby environment

The cluster provides Ruby 1.8.7, the default install for CentOS 6. If you’d like a new version of Ruby, or would like to install additional gems, a self-contained environment can be constructed using rbenv.
First, install rbenv and ruby-build in your home directory and enable it. From the website:

$ git clone https://github.com/sstephenson/rbenv.git ~/.rbenv $ git clone https://github.com/sstephenson/ruby-build.git ~/.rbenv/plugins/ruby-build $ echo ‘export PATH=”$HOME/.rbenv/bin:$PATH”‘ >> ~/.bashrc $ echo ‘eval “$(rbenv init -)”‘ >> ~/.bashrc $ test %a_%a_ext

You may need to logout and log back to get the PATH changes to apply
Second, install the version of Ruby that you’d like to use. rbenv install -l will get you a list of available Ruby versions and rbenv install <version> will install the version you choose.
Use rbenv to install a new version of Ruby and activate it Use rbenv to install a new version of Ruby and activate it
The rbenv global command can be used to “activate” your new Ruby install so that it operates as the default. This setting of the “global” Ruby will persist between sessions; you do not have to set this in your .bashrc. It can be overridden temporarily or on a per-directory basis.
Once you have the version of Ruby you would like to use, you can install any Ruby gem that you find useful.
Install a Ruby gem using the currently active environment Install a Ruby gem using the currently active environment

]]>
11227
Perl https://docs.rc.fas.harvard.edu/kb/perl/ Tue, 06 May 2014 15:56:35 +0000 https://rc.fas.harvard.edu/?page_id=11218 Introduction & Setup

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl packages are provided by CPAN. For newer versions of Perl, we recommend installing via Spack.

cpan

The CPAN module and it’s command line tool (cpan) can be used to install modules by name from the CPAN repository. Additionally, it recursively installs any modules that it depends on. If you’re using the cpan command for the first time, you’ll need to run through a series of configuration prompts. Running the cpan command will initiate a configuration session.

You may be prompted at the end to manually set the download URL list. Use the o conf init urllist command to initiate the configuration.
Setting the download URL preferences from the cpan prompt

Setting the download URL preferences from the cpan prompt.

You may also want cpan to automatically install dependencies. This can either be set in the interactive configuration, or later at the cpan prompt using the command:

cpan> o conf prerequisites_policy follow

Once cpan is setup, either simple one line command at the terminal prompt can be used to install CPAN modules:

perl -MCPAN -e 'install DBD::SQLite'   OR   cpan DBD::SQLite

Of course, your module will be something other than ‘DBD::SQLite’

Alternatively, cpan can be run interactively.
CPAN module install from the interactive shell

CPAN module install from the interactive shell

Examples

FASRC User Codes

]]>
11218
Python Programming Language https://docs.rc.fas.harvard.edu/kb/python/ Tue, 06 May 2014 15:49:13 +0000 https://rc.fas.harvard.edu/?page_id=11214 What is Python?

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a “batteries included” language due to its comprehensive standard library.

Package Managers

FASRC clusters use mamba.  Mamba is available on the FASRC cluster as a software module either as  python/3* , Miniforge3, or Mambaforge and is aliased to mamba. For more information on Mamba and it’s usage on FASRC clusters, see the Python Package Installation entry.

Best Practices

Managing Python in an HPC environment requires careful handling of dependencies, performance optimization, and resource management. Below are some best practices to keep your Python code efficient and scalable on HPC clusters.   As with any coding language in HPC, familiarize yourself with our Job Efficiency and Optimization Best Practices page.  Another great resource is Python for HPC: Community Materials.

Use Mamba with Miniforge for Environment Management

We cannot emphasize this enough!  To maintain a clean and efficient workspace, use Miniforge with Mamba for creating and managing Python environments. This ensures that your dependencies are isolated and reduces the risk of conflicts. For example, to set up an environment for data analysis with pandas and related libraries, you can use:

mamba create -n data_env python=3.9 pandas numpy matplotlib scikit-learn
mamba activate data_env

This approach ensures your Python environment is isolated, optimizing your workflows on HPC clusters.

Code Quality: “Code is read much more often than it is written”

Focus on clean, quality code.  You may need support running your program, or troubleshooting an issue.  This will help others grok your code.  Take a look at the PEP 8 Style Guide for Python.   Consider using a linter for VS Code, such as Flake8 or Ruff.  Another option is Mypy for VS Code which runs mypy on Python code cells in Jupyter notebooks.  For more resources, check out Python for HPC: Community Materials

Testing

TBD.  Add tests to your code.

Chunking: Pandas? Dask? Both?

Use Chunking! If you’re dealing with moderately large datasets, Dask can enhance Pandas by parallelizing operations. For very large datasets that exceed memory constraints, using Dask alone as a substitute for Pandas is a more effective solution.

Using Dask with Pandas:

Dask can work seamlessly with Pandas to parallelize operations on large datasets that don’t fit into memory. You can convert a Pandas DataFrame to a Dask DataFrame using dd.from_pandas() to distribute computations across multiple cores or even nodes. This approach allows you to scale up Pandas workflows without changing much of your existing code.

Using Dask Instead of Pandas:

Dask can be used as a drop-in replacement for Pandas when working with larger-than-memory datasets. It provides a familiar DataFrame API that mimics Pandas but works lazily, meaning computations are broken down into smaller tasks that are executed when you call .compute(). This makes it possible to handle datasets that would be too large for Pandas alone.

Manage Cluster Resources Dask DataFrame for Parallel and Distributed Processing

Here is a tutorial for handling larger datasets or utilizing multiple cores/nodes, Dask DataFrames scale pandas operations to larger-than-memory datasets and parallelize them across your cluster.

Dask’s compatibility with pandas makes it a powerful tool in HPC, allowing familiar pandas operations while scaling up to handle larger data loads efficiently.

More on Compressed Chunks

Switching from traditional formats like CSV to more efficient ones like Parquet [Project site][Python Package] can greatly enhance I/O performance, reducing load times and storage requirements. Here’s how you can work with Parquet using pandas:

import pandas as pd
# Reading from a Parquet file
df = pd.read_parquet('large_dataset.parquet')
# Writing to Parquet
df.to_parquet('output_data.parquet', compression='snappy')

Parquet’s columnar storage format is much faster and more efficient for reading and writing operations, crucial for large-scale data handling in HPC environments.

Become One With Your Data: Ydata-Profiling

“In a dark place we find ourselves, and a little more knowledge lights our way.” – Yoda

Ydata Profiling, formerly known as pandas profiling, is a powerful tool for automating exploratory data analysis (EDA) in Python, generating comprehensive reports that provide a quick overview of data types, distributions, correlations, and missing values. In an HPC setting, it accelerates data exploration by leveraging to handle large datasets efficiently, saving time and computational resources for researchers and data scientists.

Want to talk about easy to use?

df = pd.DataFrame(...)
profile = ProfileReport(df, title="Profiling Report"

Done!

The tool enhances data quality and understanding by highlighting issues like missing values and incorrect data types, allowing users to make informed decisions for preprocessing and feature selection. It integrates seamlessly with HPC workflows, supporting Python-based Pandas data manipulation and analysis, and generates HTML reports that are easy to share and document, facilitating collaboration within your lab.
Ydata-profiling’s ability to provide deep insights and optimize data preprocessing helps avoid redundant computations, making it ideal for complex simulations and high-dimensional data analysis in HPC environments. Check out the strong community support over at Stack Overflow.

For more a short tutorial, here is a link: Learn how to use the ydata-profiling library.

Examples

Serial Python

examples: https://github.com/fasrc/User_Codes/tree/master/Languages/Python

Parallel Computing

Take a look at the python documentation for the multiprocessing library.   We have a few examples for parallel computing using it: https://github.com/fasrc/User_Codes/tree/master/Parallel_Computing/Python

References

]]>
11214
Find-n-Replace https://docs.rc.fas.harvard.edu/kb/find-n-replace/ Mon, 24 May 2010 16:11:15 +0000 http://rc-dev.rc.fas.harvard.edu/find-n-replace/
Combinations of the find, xargs, grep, and sed commands allow you to recursively search for, and replace, strings of text in your files. The find command prints out names of files, and the xargs command reads them and passes them as arguments to another command (e.g. grep or sed). In order to handle filenames with spaces and other special characters in them, the options -print0 and -0 are used.
For example, we’re currently migrating home directories to new filesystems, and users with /n/home hardcoded in scripts will have to modify them. In the following, the string /n/home/$USER\b is a regular expression that matches the string /n/home/$USER, were $USER will automatically be filled in by your username, followed by a word boundary (i.e., if my username is joe, it won’t match /n/home/joel).
To recursively search all the files in the current working directory for all occurrences of your former home directory explicity written out, you can use this command:
find . -type f -print0 | xargs -0 grep --color "/n/home/$USER\b"
The grep command searches text for strings matching regular expressions.
Add the option -l to grep if you only want to list the names of the files that match, as opposed to print the full line of text that contains the match.
To replace all those occurrences with the string ~, you can use the following:
find . -type f -print0 | xargs -0 sed -i "s?/n/home/$USER\b?~?g"
The sed command is used to make the text substitution — the stuff between the first two ?s is what to replace, and the stuff between the second two ?s is what to replace it with. The g after that says replace all occurences, not just the first on each line.
As with any operation that could modify all your files, use this with care, maybe on some test files first, to make sure it’s doing what you expect it to do.
Using find‘s -exec option, which you may see documented in other contexts, is an alternative to combining it with xargs.

]]>
5406