For two of my projects, we’ve been using the HPC cluster at NCSU pretty heavily. This cluster uses LSF for job control and one of the problems we encountered is that several of the worker nodes have scratch directories that are full. Luckily LSF has a solution for that. When submitting jobs, LSF allows you to specify a pre-execution command (
bsub -E) that can be used to determine, if a node has the resources to complete a job. If the command is a success, then the job is run on the node, otherwise the job is put back in the queue to wait for another node.
We use the pre-execution command to test whether there is enough space on
/scratch to hold our files. For one set of runs, the rule of thumb I use is greater than 1GB of space. Here is our solution:
bsub -E "test `df -B 1G /scratch | grep /scratch | cut -c42-50` -gt 1" [...]
This command works by extracting the freespace on
df and then comparing it against our limit.
For my second trick, I’ve written my application to respond gracefully to interrupts that ask it to terminate early. When it receives a
ctrl-c (INT signal), the program stops cycling, reports partial results, and then exits. This allows the program to be stopped early without wasting the results that it has already gathered. On the cluster, when my jobs have run longer than their time limit, LSF sends a USR2 signal to them, which they treat like the INT signal above.
This was actually not that difficult to implement, the
signal function was well documented. However, on the cluster my jobs are contained in a shell script that sets up the environment, calls my application, and then cleans up, moving the output from scratch to a permanent location. This shell script was not handling the USR2 signal gracefully and was thus dieing instead of preforming cleanup. After some digging, I found that I needed to convert my shell script from
tcsh (C Shell), which has limited signal handling, to
bash (Bourne-again Shell), which has flexible signal handling. Under
bash, all I have to do is issue the following command to force
bash to ignore USR2 and INT signals.
trap ":" SIGUSR2 SIGINT
This allows the shell script to proceed with cleanup after my application has terminated early.