Running jobs on grove
(cluster web page: http://grove.ne.tamu.edu)
All jobs on our Linux cluster must be submitted to a queue. The following queues are available:
| queue name | priority | nice | availability | hosts | time limit | processor limit |
|---|---|---|---|---|---|---|
short |
35 |
20 |
all users |
all |
30 minutes |
4 processors |
normal |
30 |
20 |
all users |
all |
8 hours |
10 processors |
long |
25 |
20 |
all users |
all |
unlimited |
20 processors |
Queues are managed by Platform Lava. The full guide to Lava 1.0 is available for viewing at http://grove.ne.tamu.edu/kits/lava/1.0/lava_using_1.0.pdf.
Quick Start instructions (also see "man bsub"):
| bsub my_job | submit a job to the "normal" queue; my_job is an executable |
| bsub -n 4 my_job | submits my_job as a parallel job to start when 4 processors are available |
| bkill jobID | kills job "jobID" |
| bjobs | reports the status of Lava jobs |
To submit a job:
bsub -n <num_processors> mpirun -np <num_processors> <MPI_JOB> <ARGUMENTS>
MCNP executables are mcnp5.mpi and mcnp5 in /usr/local/bin. If you wish to run MCNP on multiple processors, you must use the mcnp5.mpi executable and submit your job using MPI. To submit a MCNP job to the queue:
bsub -n <num_processors> mpirun -np <num_processors> /usr/local/bin/mcnp5.mpi i=input eol
/scratch space is available on the cluster; "cd /scratch" and "mkdir username" (substitute your username) to make a directory to place your executables. It is recommended that you run your jobs from /scratch if possible. Remember that /scratch is not backed up so copy your important results back to your home space.
Commands you may need to use if you run jobs that do not completely finish:
| ipcs | provide information on ipc facilities (inter-process ommunications); used to determine if a user has a job maintained in system memory |
| ipcrm | remove a message queue, semaphore ste or shared memory id; used to remove ipcs |
| pdsh | run the following command on all nodes; i.e., pdsh -a ipcrm |
If you see an error message such as p4_error: alloc_p4_msg failed: 0, you have requested more memory than the node(s) can handle. Reduce your job size.







