What Is Checkpointing?

Checkpointing is a method of taking a snapshot of a running application and recording the values captured (memory content, register values, PIDs, etc.) on disk.

How Can It Help Me?

An application that crashes or is suspended can be restarted on the same machine, or migrated to another machine. Subsequently, if the system suspends a running job, this job can be migrated, started and run on a different machine.

When a guest runs a job on a machine owned by another user, that job is suspended once the owner runs a job of his/her own. The suspended process stops running, yet does not release all of the memory it uses. Checkpointing causes the suspended job to migrate to another machine, thus freeing all memory on the original machine and helping the guest save any time wasted waiting for the owner to complete his or her job.

Does It Work Under All Circumstances?

Checkpointing, which is performed with BLCR libraries, does have some limitations. See the following information taken from the BLCR FAQ:

  • BLCR does not support checkpointing of certain process resources. While the following list is not exhaustive, it lists the most significant issues we are aware of.
  • BLCR will not checkpoint and/or restore open sockets (TCP/IP, UNIX domain, etc.).
  • At restart time, any sockets will appear to have been closed.
  • BLCR will not checkpoint and/or restore open character or block devices (e.g. serial ports or raw partitions). At restart time, any devices will appear to have been closed.
  • BLCR does not handle SysV IPC objects (man7). Such resources are silently ignored at checkpoint time and are not restored.
  • If a checkpoint is taken of a process with any "zombie" children, these children will not be recreated at restart time. A "zombie" is defined as a process that has exited, but whose exit status has not yet been reaped by its parent (via wait() or a related function). This means that a wait() family call made after a restart will never return a status for such a child.

When Are Job Checkpoints Taken?

Job checkpoints are taken once every 3 hours. Users can override this with a longer (but not a shorter) interval with the -k option.

What Applications Have Been Certified by the WEXAC Team for Checkpointing?

What Applications Have Been Certified by the WEXAC Team for Checkpointing?
Application/Checkpoint Method (all tested with LSF) BLCR - Sequential BLCR -
Parallel Intra-node
BLCR -
Parallel Inter-node
MATLAB serial and parallel Successful Failed N/A – requires MATLAB Parallel toolbox (which is not owned by the Weizmann Institute)
C serial and parallel1 Successful Code-dependent Code-dependent
Bowtie serial2 Successful N/A N/A
Java serial3 Compliance development effort required Compliance development effort required Compliance development effort required

1 For the BLCR parallel intra-node and inter-node checkpointing methods, code needs be BLCR-compliant, and developers are advised to avoid local disk usage
2 Bowtie applications are serial only
3 By default, data saved by Java is stored in /tmp on the local disk, and can therefore not be captured by a checkpoint. Developers are advised to modify application code so as to avoid local disk usage.

What About Other Applications That the WEXAC Team Hasn't Checked?

Should BLCR fail to checkpoint an application, its user will receive e-mail notification accordingly. When the owner of a machine runs a job, guest jobs are suspended and terminated, even if checkpointing has failed. We recommend that guest users rewrite or reconfigure their applications as reentrant (see below).

Instructions for Use 

Great! How Can I Use It?

The public queue all.q has been configured for checkpointing.
Checkpointing is not optional. It is mandatory policy applied to all jobs submitted to public queues, such as all.q.

The checkpoint directory’s path is $HOME/.checkpoints. Checkpointing is invoked every 3 hours, and the default checkpointing method is BLCR.

cr_run must be executed, with your script or application provided as a parameter. See the following examples:

Example 1:

bsub –k “$HOME/.checkpoints 180 method=blcr” cr_run  <Your application> <your parameters>

The -k parameter supports the three following options:

  1. Checkpoint directory (this is just a filler and can’t actually be changed)
  2. Checkpoint interval in minutes
  3. heckpointing method (blcr or reent); cr_run is used for BLCR checkpointing

Example 2:

bsub cr_run myscript.sh

LSF uses $HOME/.checkpoints for checkpoint file storage. The default checkpointing method is BLCR, and the default checkpoint interval is 3 hours. Your jobs will also migrate after 5 minutes, if suspended by LSF (SSUSP).

It is necessary to locate all BLCR libraries and applications. To do this, include module load blcr in your .bashrc script.

Can I Checkpoint Parallel Applications?

Only if you use Intel mpi libraries. See the earlier section What applications have been certified by the WEXAC team for checkpointing? for more information.

Reentrant Checkpointing

Reentrant checkpointing can be applied to applications designed to support multiple snapshots of their present position, with various parameters saved, then picked up where left off.

Reentrant checkpointing is considered a best practice in the context of present day HPC workloads, and in HPC cloud service scenarios. This is due to the fact that, unlike BLCR, it is checkpointing method agnostic, and allows developers to freely checkpoint and resume at will. Users are naturally responsible for writing the code most appropriate for this checkpointing method.

Should your job already be reentrant (i.e. you developed the job with reentrant support), you must declare it to LSF as such, as follows:

bsub –k “/some/directory 180 method=reent” myscript.sh

The myscript.sh may appear as follows:

#!/bin/bash

LSB_RESTART=`bpeek $LSB_JOBID | /bin/grep "Y"`

if [ -z "$LSB_RESTART" ]; then

echo "Y"

#do stuff in  the beginning

elif [ "$LSB_RESTART" == "Y" ]; then

echo "RESTARTED the script"

# continue from last kill command

fi

 

LSB_RESTART must be used in an if statement to check whether or not the job has been restarted.