MANA: SUSE Linux 13 (Perlmutter)
================================
--------------------------------
Perlmutter Cluster at NERSC/LBNL
--------------------------------
Perlmutter is an HPC cluster running **SUSE Enterprise**. This document
uses Perlmutter as the model for SUSE Enterprise, to discuss MANA
Checkpointing and Restarting for MPI applications. However, any given
site may configure SLURM differently.
Perlmutter is a supercomputer at NERSC running SUSE Enterprise.
When it was introduced, it was the #5 supercomputer on the TOP500 list
of June, 2021. Users should also visit Perlmutter's documentation
`page `_ for more details.
.. contents:: Contents of this page
:backlinks: entry
:local:
:depth: 2
-------------------------------------
Requesting Resources for Computation
-------------------------------------
There are two methods to request resources on Discovery:
1. **Using SBATCH:**
* Start by creating a SBATCH script
.. code:: shell
# Example SBATCH script:
#!/bin/bash
#SBATCH -A
#SBATCH -C cpu
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
srun -n 32 --cpu-bind=cores -c 16 ./myapp
For SLURM terminology, a _cpu_ is a CPU core, a _node_ is a computer
host, and a _task_ is an MPI process. The _qos_ (quality of service)
is the type of partition nodes requested.
* Submit this file as job on cluster
.. code:: shell
sbatch
* Monitor the output file:
.. code:: shell
tail -f
* To interact with allocated compute nodes:
.. code:: shell
squeue --me # squeue -u $USER
ssh XXX # where 'XXX' is the allocated node's hostname
2. **Interactive session using salloc:**
* The **salloc** command is useful for allocating interactive jobs.
The ``--constraint cpu`` option specifies CPU-only nodes (no GPU).
.. code:: shell
salloc --qos interactive --nodes 1 --time 04:00:00 --constraint cpu --account XXX
.. option:: --nodes 1
Request one node to compute on.
.. option:: --time=04:00:00
Request the node for 4 hours uninterrupted.
.. option:: --account
Account name of the project this computation will be charged to.
----------------------------
Compiling MANA on Perlmutter
----------------------------
When running on the Perlmutter cluster, MANA compilation is recommended
to be performed on a login node.
Steps to compile MANA:
.. code:: shell
git clone https://github.com/mpickpt/mana
cd mana
git submodule init
git submodule update
./configure
make -j$(nproc)
--------------------------
Testing MANA on Perlmutter
--------------------------
Steps for testing MANA on the Perlmutter cluster:
1. Request one or more compute nodes interactively using salloc:
***FIXME: ``salloc`` ... (SEE "salloc", above.)***
2. Open two terminals connected to the same compute node. Compute node
can be requested using the instructions from above sections. SSH into
the compute node from a new terminal to get two terminals hooked to same
compute node. Consider the following points:
* You can check your hostname to connect via ssh using
``squeue --me`` to list all the compute nodes assigned to
your username.
* Running ``ssh XXXX`` will connect to your compute node via ssh.
(Here cXXX is a placeholder for your compute-node name.)
3. Launch a MANA coordinator in Terminal 1:
.. code:: shell
PATH_TO_MANA/bin/mana_coordinator
The ``mana_coordinator`` command also supports these command line arguments:
.. option:: -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT)
Port to listen on (default: 7779)
.. option:: --port-file filename
File to write listener port number.
(Useful with '--port 0', which is used to assign a random port)
.. option:: --status-file filename
File to write host, port, pid, etc., info.
.. option:: --ckptdir (environment variable DMTCP_CHECKPOINT_DIR):
Directory to store dmtcp_restart_script.sh (default: ./)
.. option:: --tmpdir (environment variable DMTCP_TMPDIR):
Directory to store temporary files (default: env var TMPDIR or /tmp)
.. option:: --write-kv-data:
Writes key-value store data to a json file in the working directory
.. option:: --exit-on-last
Exit automatically when last client disconnects
.. option:: --kill-after-ckpt
Kill peer processes of computation after first checkpoint is created
.. option:: --timeout seconds
Coordinator exits after even if jobs are active
(Useful during testing to prevent runaway coordinator processes)
.. option:: --stale-timeout seconds
Coordinator exits after if no active job (default: 8 hrs)
(Default prevents runaway coord's; Override w/ larger timeout or -1)
.. option:: --daemon
Run silently in the background after detaching from the parent process.
.. option:: -i, --interval (environment variable DMTCP_CHECKPOINT_INTERVAL):
Time in seconds between automatic checkpoints
(default: 0, disabled)
.. option:: --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME
Coordinator will dump its logs to the given file
.. option:: -q, --quiet
Skip startup msg; Skip NOTE msgs; if given twice, also skip WARNINGs
.. option:: --help:
Print this message and exit.
.. option:: --version:
Print version information and exit.
4. Launch the MPI process under MANA using srun:
.. code:: shell
mkdir ckpt_images
srun -n 2 PATH_TO_MANA/bin/mana_launch --ckptdir ckpt_images PATH_TO_MANA/mpi-proxy-split/test/ping_pong.exe
Use ``mpirun`` instead of ``srun`` if you are using the Open MPI module.
**NOTE:** Usually, you use ``mana_launch`` directly with an executable
compiled with the local ``mpicc`` command. For some cases (e.g., MPICH-4.x),
we have encountered an MPI library that depends on other libraries with
constructors (e.g., intel, UCX libraries) that gain control before MANA.
This can interfere with the proper functionig of ``mana_launch``.
If you enounter this, there are two possible workarounds.
**NOTE:** For background, a MANA computation uses a split process
architecture. Two programs (an upper-half program contains the user MPI
application, but it uses stub libraries that link MPI calls to an MPI
library within a lower-half program. The lower half is a standalone
MANA-specific MPI application. AT checkpoint time, only the upper
half is saved, and at restart time, the lower-half program restores the
memory of the upper half, and re-binds it to the lower-half MPI library.
For details, see the original :ref:`MANA paper`.
A. For both open and closed source MPI applications, we provide
an option to use *shadow libraries* for the ``upper half`` of MANA,
only. This adds to the library search path a directory of dummy
libraries to shadow certain libraries related to MPI. The ``lower
half`` of MANA uses all of the standard MPI libraries. The directory
of shadow libraries is contained in ``PATH_TO_MANA/lib/tmp`` and
is used ONLY with
``mana_launch``.
.. option:: --use-shadowlibs
Launch MANA and use the shadow libraries in the upper half.
B. For open source MPI applications, a custom MANA compiler may be used:
``PATH_TO_MANA/bin/mpicc_mana``. (And do not use ``--use-shadowlibs``
in this case.)
.. code:: shell
mpicc_mana my_mpi_application.c
5. Create a checkpoint using Terminal 2:
.. code:: shell
PATH_TO_MANA/bin/mana_status -c
6. Restart from the checkpointed state:
.. code:: shell
PATH_TO_MANA/bin/mana_restart --restartdir ckpt_images
--------------------------------------
Note:
--------------------------------------
Checkpoint images cannot be created after ``MPI_Finalize`` is called by application. This is
done to avoid creating corrupt checkpoint images which cause segmentation fault at restart.
--------------------------------------
Note: three ways to create checkpoints
--------------------------------------
There are three ways to create a checkpoint.
1. Using ``mana_command -c`` as above.
2. Periodic checkpointing with ``-i 60`` (60 seconds). This option
can be used with either ``mana_coordinator``, ``mana_launch``, or
``mana_restart``.
3. In advanced usage, there's a way to request a checkpoint under program control.