Running jobs on grove


(cluster web page: http://grove.ne.tamu.edu)

All jobs on our Linux cluster must be submitted to a queue. The following queues are available:

queue name priority nice availability hosts time limit processor limit
short
35
20
all users
all
30 minutes
4 processors
normal
30
20
all users
all
8 hours
10 processors
long
25
20
all users
all
unlimited
20 processors

Queues are managed by Platform Lava. The full guide to Lava 1.0 is available for viewing at http://grove.ne.tamu.edu/kits/lava/1.0/lava_using_1.0.pdf.

Quick Start instructions (also see "man bsub"):

bsub my_job submit a job to the "normal" queue; my_job is an executable
bsub -n 4 my_job submits my_job as a parallel job to start when 4 processors are available
bkill jobID kills job "jobID"
bjobs reports the status of Lava jobs

To submit a job:

bsub -n <num_processors> mpirun -np <num_processors> <MPI_JOB> <ARGUMENTS>

MCNP executables are mcnp5.mpi and mcnp5 in /usr/local/bin. If you wish to run MCNP on multiple processors, you must use the mcnp5.mpi executable and submit your job using MPI. To submit a MCNP job to the queue:

bsub -n <num_processors> mpirun -np <num_processors> /usr/local/bin/mcnp5.mpi i=input eol

/scratch space is available on the cluster; "cd /scratch" and "mkdir username" (substitute your username) to make a directory to place your executables. It is recommended that you run your jobs from /scratch if possible. Remember that /scratch is not backed up so copy your important results back to your home space.

Commands you may need to use if you run jobs that do not completely finish:

ipcs provide information on ipc facilities (inter-process ommunications); used to determine if a user has a job maintained in system memory
ipcrm remove a message queue, semaphore ste or shared memory id; used to remove ipcs
pdsh run the following command on all nodes; i.e., pdsh -a ipcrm

If you see an error message such as p4_error: alloc_p4_msg failed: 0, you have requested more memory than the node(s) can handle. Reduce your job size.