PBS: job submission

Hooray! We are now ready to send jobs.

As already mentioned, the preferred way of submitting a job is using a script. The script is read by the batch system - (PBS Professional), which looks for "directives" - lines at the start of the file beginning with "#PBS". These are taken as the runtime options. The first line is always executed, and any following line containing commands not beginning with # are executed as well. The lines starting with # (and not #PBS) are ignored and may be used as comments (e.g. ##PBS is interpreted as a comment).

In the previous section we noted that PBS operates with chunks of resources, therefore, a very important PBS directive

#PBS -l ...

asks for resources. Let’s for example, ask for 8 CPUs and 6 GB memory:

#PBS -l select=1:ncpus=8:mem=6gb

Note the use of the “select” keyword, it signifies how many “chunks” of the same kind we are interested in. Suppose that we need more than one chunk of 8 CPUs – let say three:

#PBS -l select=3:ncpus=8:mem=6gb

This request will give us 24 CPUs and 18 GB memory, it does not specify whether these chunks have to be on a single node in the cluster or on different ones. If you are using an MPI-based program, you can also specify a different number of MPI processes per chunk than the number of CPUs. This is done using the “mpiprocs” keyword:

#PBS -l select=3:ncpus=8:mpiprocs=8:mem=6gb

The last command will specify 8 MPI processes for 8 cpus on 3 chunks.

Please note, that the default resource request in our cluster is 1 CPU and 1000 MB memory and it may not be sufficient for your application.

Now to the most important point of this guide: how are the requested resources related to your program? The short answer is: they are not! A user may ask for 20 CPUs and 100 GB memory; he will get them and will be charged for those resources. However, it is solely the user’s responsibility to ensure that his application indeed utilizes those resources. Based on this information, you may now understand why the answer to the question: "if I asked for 8 cores, will my Matlab run faster than on my PC?" is "it depends". Your executable has to be capable of utilizing the requested computational resources, and you have to let it know how many CPUs you would like it to use. The exception are MPI-based codes where the mpiprocs keyword creates a special file for the mpirun command to read.

Now, suppose we have a program that can utilize 24 cores and 18 GB of memory. What else do we need in our submission script?

Most importantly, we need to specify a queue:

#PBS -q

Note: if you don’t specify a queue, sleep queue will be set by default.

It is a good idea to give a distinguishable name to a job. This will make your life easier:

#PBS -N

It would also be nice to receive an email notification upon job begin and job end:

#PBS -m eb

and unless you want to manage another inbox, it is standard practice to specify where this mail should arrive:

#PBS -M my.mail@somewhere.com

Since the job is non-interactive, the standard output and standard error will be redirected to files named “job_name.o{job_id}” and “job_name.e{job_id}” respectively. You may prefer to combine everything in a single file. In this case add the following option:

#PBS -j oe

Note: the files above are located in the /var partition on the compute nodes and which has limited capacity. Therefore, if your application has massive standard output and/or standard error it is a good idea to redirect them to a file in your work directory. For the standard output:

./my_program > std_out.file

For the standard error:

./my_program 2> std_err.file

For the both standard output and error:

./my_program 2>&1> std_out_and_err.file

In order to discuss the next option, we first need to expand of fair-use of computation allocations. Chemfarm is a shared (“public”) resource, and a user is generally allowed to use resources primarily allocated to others on a free-space basis. Therefore, a job that is above user- or group allocation may be suspended - status="S". If the exact same resources become free again, the job will try to resume from the same state it was before the suspension. However, jobs that remain suspended for more than one hour will either be killed, and could be (by default) automatically re-queued. In other words, the job will reenter the queued state with the same job number and once it starts, it will start from the beginning. The user is responsible to make sure such a restart does not overwrite any file that might be needed by him later. In order to prevent the re-queue mechanism, include the following directive in the job script.

#PBS -r n

Based on previous experiences, these PBS options are the most important and useful one. Nevertheless, there are many more available, and can be found here. Useful quick guide is here.

Now that the PBS part of our submission script is defined, we can move on to the actual content of the job script - your calculation.

For all practical purposes, any command written in the script after the last #PBS option is executed by the a (random) CPU on one of the compute-nodes allocated to your job. While writing the script, you may treat this just as if you logged-in to another computer, which has the access to your files. Later, in the "User Environment" section it will be explained what are the main differences between the compute nodes and and actual login node. It is important to note that just as if you have actually logged-in to the node, the initial directory at job begin is the user home directory (e.g. /home/$USER). Since probably the job input files are on a different directory, you will need to input a `cd` command. Most conveniently, the directory from which you submitted the job (where the qsub command was input), is already placed as a shell environment variable, $PBS_O_WORKDIR. Most job then begin with the line:

cd $PBS_O_WORKDIR

Typically, one would then move to actually execute the program that performs the calculation. For example, if you executable is found in $PBS_O_WORKDIR, you would now write: (note the PATH ./ is added to the file name)

./my_program

If the executable file my_program is placed in a folder different from the folder in which the job started, just give a PATH to that folder:

PATHtoProgram/my_program

If your program does not report timing information, it might be useful to ask the shell to time your calculation by using the 'time' utility:

time ./my_program

If your program is MPI-enabled, you only need to add the 'mpirun' command to automatically run the same program with the allocated number of cpus (remember the previous discussion about the mpiprocs PBS option). Note that you can still use the 'time' utility as well:

time mpirun ./my_program

If you have experience from running MPI programs you may find the previous line slightly odd, since usually the 'mpirun' command requires to be notified of the number of processes to 'spawn'. However, if the resource line of a submission script is written properly, all the necessary information for 'mpirun' will already be included by PBS in a special file that 'mpirun' is aware of.

Lastly, these useful additional lines in the job file are suggested:

echo `hostname`

to know on which compute node your job was running. For a multimode job, you will see a first node.

date

If you add this command before and after the application line, it will add the date to the standard out, which will let you easily know when the actual calculation started and finished.

Now that we covered both part of the job file, let us put everything together. The following are working examples for a serial job (one cpu per job), and here for parallel MPI job (many cpu jobs).

Unfortunately, some applications such as Matlab are extremely "unfriendly" to networked file systems. Therefore, users running them are requested to use the compute node's (small) local storage, which is physically inside. Otherwise, a single job could impact performance of the entire cluster and interfere with other users' jobs. In order to use the local storage, please use the examples below to make sure the movement of data before and after the calculation is complete before the job exits/completes. You can also make sure that if the job is finished successfully, the scratch directory which contains the temporary data will be removed (as a courtesy to other users). If the job fails, the directory /scratch/username/{job_id} will be retained on the executing node (whose name will be printed in the job output file due to the "echo `hostname`" command discussed earlier.

Because the local storage on the compute nodes is small, it is important to clean your directories on the scratch disks from time to time. The scratch directory may be accessed during runtime by a standard "ssh" to the executing node and changing directory to the chosen scratch directory "cd /scratch/username/{job_id}". Since you cannot access the job standard output while it is running, you will need to use the commands in the "Job Monitoring" section in order to find on which node your job is running.

As promised, the following is an example of a submission script for a job, running a Matlab program (spaces are important). The script implies that the input files for the program are placed in a subdirectory called "input", while the output files will be placed in an (existing) subdirectory "output". The Matlab program is placed in the working directory itself. Note that if your m-file is a function "my_func.m" you may use the "matlab -r my_func" command (Important: the function inside the file has to be named my_func too!). If your m-file is just a list of Matlab commands, you may use the more universal "matlab < my_file.m" command.

First, create your scratch directory on the local disk if it does not exist

if ! [ -d /scratch/$USER ]; then
    mkdir /scratch/$USER
fi

Then, using the job number stored in $PBS_JOBID, create the specific scratch directory and hold its name as $SCRDIR:

export SCRDIR=/scratch/$USER/$PBS_JOBID
mkdir $SCRDIR

Remind yourself where the job was running:

echo `hostname`

Proceed to copy any files you need from the job directory to scratch:

change directory to the "input" directory

cd $PBS_O_WORKDIR/input

loop over all the files in there and copy one by one to $SCRDIR

for i in `ls -1`; do
    cp $i $SCRDIR/$i
done

copy the matlab program to $SCRDIR

cd $PBS_O_WORKDIR
cp matlab_program.m $SCRDIR

Change to the scratch directory

cd $SCRDIR

and run the job:

time /usr/local/bin/matlab-2019b/matlab -r matlab_program

Now finish up, collect the results:

for j in `ls -1 *.mat`; do
    cp $j $PBS_O_WORKDIR/output/$j
done

The simple trick to tell PBS to wait, until all the files are copied:

if [ 0 == 0 ]; then
    cd $PBS_O_WORKDIR
    rm -rf $SCRDIR
fi

A full example is found here, and finally, do not forget to submit your job:

[user@login01 ~]$ qsub my_script