PBS: job debugging

The previous chapters of this guide were describing the situation when everything works. Unfortunately, life is full of surprises, and sometimes things go in an undesirable way. This section will provide you with some basic tools for understanding of what went wrong. As you gain experience with "job debugging", you will learn to notice the patterns and will be able to easily understand why something went wrong.

My job finally started and it does not work. What do I do?

There are three main reasons that the job does not complete successfully:

Input files are wrong or missing
Shell environment is not defined properly
Code itself has a problem

When hunting for the reason your job is failing, it is very helpful to use the "short" queue. This queue is intended for debugging purposes, especially for reason (3).

If you encounter a job that does not even start running, try to run a job with the same code that you know should work because it either ran before or someone else ran it. If this does not help, you might need to dig a little bit deeper.

The immediate and the essential source for discovering problems are the "standard" error and output files created by PBS for every job. If not configured otherwise, these will appear in the same path from which the job is submitted, once the job is completed. They will have the extension .oJOB_ID and .eJOB_ID where $JOB_ID is the same as in the qstat command. When the job finishes abnormally, usually the standard error file will contain the information or a hint to what had happened.

In addition, it is always worth to check your quota on the file system you are trying to write to, since job crashes because of quota enforcing are not uncommon.

The 'tracejob' command is used to generate a full log of the history of the job from the moment it was submitted. Using

tracejob <-n[days]> $JOB_ID

command may give you some additional information if your job finishes with non-zero exit code. The output of this command is quite intricate, but with enough experience it is easy to extract useful information from it. If none of these explained the source of your error, things are going to get tricky, since the cause is probably within the job.

As we already described, your jobs are not running on the server you are logged into. Therefore, extracting information from the job as it is running is not as straightforward. On Chemfarm, PBS provides us with this option. All you need to do is submit an interactive job. This will give you access via SSH to an active job on a compute node

qsub -I

Similar to a regular job an interactive job will enter the queue, but instead of running your application you will get a prompt on the compute node where the job has started.

An interactive job should not be submitted via script, the parameters discussed in the "Job Submission" section have to be provided via the command (if you insist on submitting a script using the -I flag, know that none of the commands will be run, only the PBS definitions will be taken). For a simple "debug" case, not all of the parameters are required, but at least typical resources should be requested, e.g.:

qsub -I -l select=2:ncpus=8:mem=11gb

When the prompt reappears inside the job, you are now "on the node" and are able to test execution of the commands you wrote in the job script. You can access now all the PBS environmental variables accessible to job, such as $PBS_O_WORKDIR or $PBS_NODEFILE. You have, of course, to change to the appropriate directory, load necessary modules and run your code. Unfortunately, there is no complete recipe for troubleshooting, and everyone has to become a detective by himself, however there are some general tips and recommendations:

SSH

If you are running a multinode job, check that your passwordless SSH between the nodes is working. If not - ask for system manager's help.

User environment

If you have changed your shell configuration, check whether it is done properly and that your changes are propagated to the job (hint: check the non-interactive configuration files). You can even enter the command "echo $VAR" in the job script where VAR is a shell variable you wish to probe. The output of this command will appear in the standard out.

Check that the modules you used while developing and testing are the same you use for a submitted job. The user is encouraged to regularly put the command "module list" in the job script so that its output can be read form the standard out.

If you use custom libraries or programs, ensure their proper addressing in LD_LIBRARY_PATH and in the PATH variables accordingly, preferably using the "echo" command as explained above.

Finally, use the 'ldd' command to actually check that an executable "knows" about all the libraries it uses. Compare the output of the command 'ldd program' from the job's standard out and from running it on the login shell. Note that "program" is a generic name and you should substitute it with the executable you are trying to run.

Check the permissions of all the files you are using. A program file has to be executable; you have to read from the existing files with read permission at least; if you write some output ensure you have write permissions on the folder you trying to write to.

General Tips Related to the actual run

If your program uses standard output, ensure to redirect it to a file, otherwise you may fill up a local partition (/var) on the compute node and your job will fail.

If you are using local "/scratch" ensure that you do not run out of space during the execution of your code.

Check that your code does not run out of memory, this usually explains "segmentation fault" and "killed with signal 6" type of errors. Note, that running out of memory does not necessary happen right at job start, you may need to wait hours before such exception happens. Try to artificially create a scenario that depicts the situation as close to the real case as possible.

There are different generation of processors in Chemfarm, ensure that your code is compiled properly to use available CPU instruction set.