Queuing System
Obsidian uses TORQUE Resource Manager to control batch jobs and distributed computing resources. All jobs should be submitted via TORQUE. Please be courteous to other users and do not run computational tasks on the head node. Any process running on the head nodes which adversely impacts system operation is subject to preemptory termination
Use Infiniband and IPoIB
- Serial code
- MPI parallel code
- Script and shell command
By default, IPoIB (IP over Infiniband) is enabled, so file I/O in serial programs will use Infiniband.
To use Infiniband for message passing (MPI program), load the mvapich2 module when compile, and load mvapich2 module again when submitting jobs. Or you may load the module in your .bash_profile so you don't need to explicitly reload it every time.
module load mvapich2-2.1/gcc
When you use shell command to transfer file (e.g., scp) between nodes, add "-ib" to the node id if you want to use IPoIB. For example,
scp file_name node_id:remote_filename
uses 1Gb/s Ethernet connect, while
scp file_name node_id-ib:remote_filename
uses the Infiniband connect.
Submitting Batch Jobs
- qsub -- submit a job. For more information, see the qsub manual (man qsub).
- Most commonly used qsub options
- qstat -- monitor the status of a job or jobs. For more information, see the qstat manual (man qstat)
- qdel -- delete a job
To submit a batch job, use qsub job_script (see Sample Job Scripts below). Once you submit a job, the job id will be displayed in the next line. For example:
qsub pbs_mpi.sh
174.obsidian.cluster
Option | Description | Example |
---|---|---|
-l walltime=hh:mm:ss | Requests the amount of time needed for the job. Default is one hour. | #PBS -l walltime=10:00:00 |
-l nodes=n:ppn=p | Requests number of nodes and processors per node. The maximum ppn is 20. Default is one processor on one node. | #PBS -l nodes=2:ppn=20 |
-N |
Sets the job name, which you will see when you use qstat to check status of your jobs. | #PBS -N job_test |
-j oe | By default, PBS returns two log files, one for the standard output stream, the other for the standard error stream. This option joins both into a single log file. | #PBS -j oe |
-o |
Renames the output log file to the filename you specify. | #PBS -o job_test.out |
-I | Requests an interactive batch job (see Interactive Job at the bottom). | qsub -I |
Fro example, to see your jobs in queue, type qstat:
qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
174.obsidian cpi-test ybai1 0 Q batch
To see all the jobs in queue in full details, use
qstat -a -f
To delete your job, use qdel job_id. For example:
qdel 176.obsidian
Sample Job Scripts
- Serial Jobs
- MPI jobs
- OpenMP Jobs
- Hybrid MPI/OpenMP Jobs
#!/bin/bash
#PBS -S /bin/bash
#PBS -N cpi-test
#PBS -j oe
#PBS -o ./omp-test.out
#PBS -l nodes=1:ppn=10
#PBS -l walltime=00:30:00
#PBS -M yihua.bai@indstate.edu
cd $PBS_O_WORKDIR
./a.out
#!/bin/bash
#PBS -S /bin/bash
#PBS -N cpi-test
#PBS -j oe
#PBS -o ./cpi-test.out
#PBS -l nodes=2:ppn=20
#PBS -l walltime=00:30:00
#PBS -M yihua.bai@indstate.edu
module load mvapich2-2.1/gcc
cd $PBS_O_WORKDIR
mpirun -np 40 -machinefile ${PBS_NODEFILE} ./a.out
Unless you do not use module or load module by default in your .bash_profile or .csh_profile, you need to load the same module you used when compiling the program in your job script. Otherwise, the computing nodes may not inherit the same environment variables as the head node in the batch job.
Due to NUMA effects, you may want to specify OMP_NUM_THREADS up to 10 to get optimum scalability.
#!/bin/bash
#PBS -S /bin/bash
#PBS -N cpi-test
#PBS -j oe
#PBS -o ./omp-test.out
#PBS -l nodes=1:ppn=10
#PBS -l walltime=00:30:00
#PBS -M yihua.bai@indstate.edu
export OMP_NUM_THREADS=10
cd $PBS_O_WORKDIR
./a.out
Please request a consultation by submitting an OIT help desk ticket and request the ticket to be assigned to Research Computing Queue.
Interactive Job
If you need to run jobs interactively, use qsub to request interactive job. For example:
qsub -I -l nodes=2:ppn=20,walltime=00:30:00
qsub: waiting for job 179.obsidian.cluster to start
qsub: job 179.obsidian.cluster ready