MANA: CentOS 7/Rocky 8¶
Discovery Cluster at Northeastern University¶
Discovery is an HPC cluster running CentOS 7. This document uses Discovery as the model for CentOS 7 and Rocky 8/9, to discuss MANA Checkpointing and Restarting for MPI applications. However, any given site may configure SLURM differently.
Discovery is a high performance computing (HPC) resource for the Northeastern University research community. In 2024, the Discovery cluster provides access to over 50,000 CPU cores and over 525 GPUs to all Northeastern faculty and students free of charge. Compute nodes are connected with either 10 GbE or high data rate InfiniBand (200 Gbps or 100 Gbps), supporting all types and scales of computational workloads. Users should also visit Discovery’s information page for more details.
Requesting Resources for Computation¶
There are two methods to request resources on Discovery: batch and interactive.
Using SBATCH:
Start by creating a SBATCH script
# Example SBATCH script: #!/bin/bash #SBATCH -J MyJob # Job name #SBATCH -N 2 # Number of nodes #SBATCH -n 16 # Number of tasks #SBATCH -o output_%j.txt # Standard output file #SBATCH -e error_%j.txt # Standard error file #SBATCH --mail-user=$USER@northeastern.edu #SBATCH --mail-type=ALL ./my_programSubmit this file as job on cluster
sbatch <sbatch_script>Monitor the output file:
tail -f <output_file>To interact with allocated compute nodes:
squeue -u <your_username> ssh cXXX # where 'cXXX' is the allocated node's ID
Interactive Allocation Tool:
Load the job-assist module:
module load job-assistRun the interactive tool:
job-assist # Example menu options: # SLURM Menu: # 1. Default mode (srun --pty /bin/bash) # 2. Interactive Mode # 3. Batch Mode # 4. ExitThe policy on Discovery is to run on a compute node, and not the original login node. If you just need a shell for compilation, then choose
1(srun --pty /bin/bash) to obtain a shell on a compute node.
Interactive session using srun:
The srun command is useful for interactively running jobs, once you are on a compute node. In this example, instead of using
job-assist, we ask for a shell on the command line. Note that on Discovery, compute nodes may be shared. Even if you ask for all of the CPU cores (as specified by--ntasks), if you are not currently running a job, then the system still may allocate another user to the same node. Further, on Discovery, nodes may use either TCP/IP (Ethernet) or InfiniBand. Optionally, add--constraint=ibtosrunto request nodes with InfiniBand. (Other SLURM sites may name theibfeature to a different name.)srun --partition=short --nodes=1 --ntasks=8 --cpus-per-task=1 --time=08:00:00 --mem=8GB --pty /bin/bash
- --partition=short¶
Define type of partition required.
- --nodes=1¶
Request one node to compute on. (Max allowed=2 for short partitions)
- --ntasks=8¶
Number of tasks (CPU cores) to run on requested compute nodes.
- --cpus-per-task=1¶
Inform resource manager that we will run one process per CPU-core.
- --time=08:00:00¶
Request the node for 8 hours uninterrupted.
- --mem=8GB¶
Requesting 8GB per CPU-core.
- --constraint=ib¶
Option specific to Discovery: request InfiniBand nodes
- --pty /bin/bash¶
Create an interactive shell using
/bin/bash`
Compiling MANA on Discovery¶
When running on the Discovery cluster, MANA compilation must be performed on a compute node. Login nodes are restricted from running compilations or other long commands by the admin.
Steps to compile MANA:
Switch to an interactive compute node using the instructions above.
Confirm you are on a compute node (hostname should start with either a ‘c’ or ‘d’):
Set your modules to a reasonable default. As of early 2025, the default is gcc-4.8, python-2.7, and no MPI. We currently are choosing:
# Check for compatible gcc, python, mpi module avail gcc module load gcc/8.1.0 module avail python module load python/3.8.1And next, choose your preferred MPI. When in doubt, use
module show <modulefile>to get more information on the module. Here, we see a user switching choices.module avail mpi module avail mpich module avail openmpi # Default is currently openmpi/3.1.2 module load mpich # Accept default: currently mpich/3.3.2 module listNow proceed with installing MANA on Discovery. For more detailed instructions, visit the MANA Home page.
git clone https://github.com/mpickpt/mana cd mana git submodule init git submodule update ./configure make -j8We use
-j8because we requested--ntasks=8earlier. If you are developing software and wish to see internals of MANA, choose./configure --enable-debuginstead.
Testing MANA on Discovery¶
Steps for testing MANA on the Discovery cluster:
Request a compute node interactively. As before, do:
srun --partition=short --nodes=1 --ntasks=8 --cpus-per-task=1 --time=08:00:00 --mem=8GB --pty /bin/bash
Open two terminals connected to the same compute node. Compute node can be requested using the instructions from above sections. SSH into the compute node from a new terminal to get two terminals hooked to same compute node. Consider the following points:
Your .ssh directory should be configured to use a key-handshake with localhost.
You can check your hostname to connect via ssh using
squeue --meto list all the compute nodes assigned to your username.Running
ssh XXXXwill connect to your compute node via ssh. (Here cXXX is a placeholder for your compute-node name.)
Launch a MANA coordinator in Terminal 1:
PATH_TO_MANA/bin/mana_coordinatorThe
mana_coordinatorcommand also supports these command line arguments:
- -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT)¶
Port to listen on (default:
7779)
- --port-file filename¶
File to write listener port number. (Useful with
--port 0, which is used to assign a random port)
- --status-file filename¶
File to write host, port, pid, etc., info.
- --ckptdir (environment variable DMTCP_CHECKPOINT_DIR):¶
Directory to store dmtcp_restart_script.sh (default: ./)
- --tmpdir (environment variable DMTCP_TMPDIR):¶
Directory to store temporary files (default: env var TMPDIR or /tmp)
- --write-kv-data:¶
Writes key-value store data to a json file in the working directory
- --exit-on-last¶
Exit automatically when last client disconnects
- --kill-after-ckpt¶
Kill peer processes of computation after first checkpoint is created
- --timeout seconds¶
Coordinator exits after <seconds> even if jobs are active (Useful during testing to prevent runaway coordinator processes)
- --stale-timeout seconds¶
Coordinator exits after <seconds> if no active job (default:
8hrs) (Default prevents runaway coord’s; Override w/ larger timeout or-1)
- --daemon¶
Run silently in the background after detaching from the parent process.
- -i, --interval (environment variable DMTCP_CHECKPOINT_INTERVAL):¶
Time in seconds between automatic checkpoints (default:
0, disabled)
- --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME¶
Coordinator will dump its logs to the given file
- -q, --quiet¶
Skip startup msg; Skip NOTE msgs; if given twice, also skip WARNINGs
- --help:¶
Print this message and exit.
- --version:¶
Print version information and exit.
Launch the MPI process under MANA:
mkdir ckpt_images mpirun -n 2 PATH_TO_MANA/bin/mana_launch --ckptdir ckpt_images PATH_TO_MANA/mpi-proxy-split/test/ping_pong.exeNOTE: Usually, you use
mana_launchdirectly with an executable compiled with the localmpicccommand. For some cases (e.g., MPICH-4.x), we have encountered an MPI library that depends on other libraries with constructors (e.g., intel, UCX libraries) that gain control before MANA. This can interfere with the proper functionig ofmana_launch. If you enounter this, there are two possible workarounds.NOTE: For background, a MANA computation uses a split process architecture. Two programs (an upper-half program contains the user MPI application, but it uses stub libraries that link MPI calls to an MPI library within a lower-half program. The lower half is a standalone MANA-specific MPI application. At checkpoint time, only the upper half is saved, and at restart time, the lower-half program restores the memory of the upper half, and re-binds it to the lower-half MPI library. For details, see the original MANA paper.
For both open and closed source MPI applications, we provide an option to use shadow libraries for the
upper halfof MANA, only. This adds to the library search path a directory of dummy libraries to shadow certain libraries related to MPI. Thelower halfof MANA uses all of the standard MPI libraries. The directory of shadow libraries is contained inPATH_TO_MANA/lib/tmpand is used ONLY withmana_launch.
- --use-shadowlibs¶
Launch MANA and use the shadow libraries in the upper half.
For open source MPI applications, a custom MANA compiler may be used:
PATH_TO_MANA/bin/mpicc_mana. (And do not use--use-shadowlibsin this case.)mpicc_mana my_mpi_application.c
Create a checkpoint using Terminal 2:
PATH_TO_MANA/bin/mana_status -c
Restart from the checkpointed state:
PATH_TO_MANA/bin/mana_restart --restartdir ckpt_images
Note:¶
Checkpoint images cannot be created after MPI_Finalize is called by application. This is
done to avoid creating corrupt checkpoint images which cause segmentation fault at restart.
Note: three ways to create checkpoints¶
There are three ways to create a checkpoint.
Using
mana_command -cas above.Periodic checkpointing with
-i 60(60 seconds). This option can be used with eithermana_coordinator,mana_launch, ormana_restart.In advanced usage, there’s a way to request a checkpoint under program control.