Cluster Magic with LSF

|

For two of my projects, we’ve been using the HPC cluster at NCSU pretty heavily. This cluster uses LSF for job control and one of the problems we encountered is that several of the worker nodes have scratch directories that are full. Luckily LSF has a solution for that. When submitting jobs, LSF allows you to specify a pre-execution command (bsub -E) that can be used to determine, if a node has the resources to complete a job. If the command is a success, then the job is run on the node, otherwise the job is put back in the queue to wait for another node.

We use the pre-execution command to test whether there is enough space on /scratch to hold our files. For one set of runs, the rule of thumb I use is greater than 1GB of space. Here is our solution:

bsub -E "test `df -B 1G /scratch | grep /scratch | cut -c42-50` -gt 1" [...]

This command works by extracting the freespace on /scratch from df and then comparing it against our limit.

For my second trick, I’ve written my application to respond gracefully to interrupts that ask it to terminate early. When it receives a ctrl-c (INT signal), the program stops cycling, reports partial results, and then exits. This allows the program to be stopped early without wasting the results that it has already gathered. On the cluster, when my jobs have run longer than their time limit, LSF sends a USR2 signal to them, which they treat like the INT signal above.

This was actually not that difficult to implement, the signal function was well documented. However, on the cluster my jobs are contained in a shell script that sets up the environment, calls my application, and then cleans up, moving the output from scratch to a permanent location. This shell script was not handling the USR2 signal gracefully and was thus dieing instead of preforming cleanup. After some digging, I found that I needed to convert my shell script from tcsh (C Shell), which has limited signal handling, to bash (Bourne-again Shell), which has flexible signal handling. Under bash, all I have to do is issue the following command to force bash to ignore USR2 and INT signals.

trap ":" SIGUSR2 SIGINT

This allows the shell script to proceed with cleanup after my application has terminated early.

About this Entry

This page contains a single entry by Reed A. Cartwright published on February 25, 2009 8:23 AM.

Using LaTeX Lettrine Package to Create Drop Caps was the previous entry in this blog.

Finally: Antagonism between Local Dispersal and Self-Incompatibility Systems is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Archives

Powered by Movable Type 4.37