Space University of Florida - The Foundation of the Gator Nation
University of Florida College of Liberal Arts and Sciences
Space
Quantum Theory Project QTP Home page
Slater Lab

Moab Batch System Guide

User introduction

The commands for managing jobs only work on the nodes linx64 and wukong and ock.
The command qstat shows the running jobs.
qstat -q shows the available queues and their time and memory and nr of CPU limits.
qsub my.job is used to submit jobs. You can either supply flags to the command or include options in your script, which is the recommended way.
pbsnodes -a is used to check the status and properties of the nodes in the clusters.
In addition to these torque commands, there are the moab commands that do the same thing. Sometimes they provide more detailed information. The command showq show thejobs in the queue like qstat.
checkjob JOBID gives details about the job. This is most useful when the job does not run yet and may tell you why it is waiting, although the message can be very cryptic.
The command checknode NODENAME tells you things about the node, such as which jobs are running on it and which jobs in the queue have a reservation to run on it. The reservation will show the estimated start time.

NOTE:
If you create the job script myscript,job on a Windows machine, you MUST run
dos2unix myscript.job
to remove the Carriage Returns at the end of each line; if you fail to do this, the script will terminate with strange errors such as "unexpected end-of-file on standard input".

Queue definitions

The queue definitions can be checked with qstat -q and qstat -Q -f.

  1. quick Used for debugging and testing and short production jobs.
  2. clever Used for all production work.
  3. brute Use this queue for jobs that run very long, such as old serial programs that have not be parallelized.
If you do not specify a time limit, the default will be used as listed in the table below.
Queue name Default walltime Maximum walltime
quick 0:30 2:00
clever 38:00 96:00 (4 days)
brute 192:00 672:00 (28 days)
The default physical memory allocation per CPU is 900 MB. You can change this by adding the request line
#PBS -l pmem=600mb
	  
Because arwen nodes have only 1,500MB per node, the default of 900 MB for two CPUs does not fit and such jobs will linger in the queue forever. See further below for example jobs.

Property definitions

Torque allows each node to be assigned properties. These can be specified in the job scripts so that you can direct Moab to allocate the correct node(s) to your job. You can check the properties with pbsnodes -a.

  1. The property arwen, haku, ra, surg, wukong, ock denotes the group of nodes belonging to the cluster with that name.
  2. i686, x8664, amd, ia64 denotes the CPU architecture. The nodes in Arwen have IA32 architecture and the arch command returns i686. The nodes in Haku have EM64T architecture and the arch command returns x86_64; since Torque does not allow underscore in properties, we use x8664. The nodes in RA are AMD Opteron 64 bit processors and the arch command returns x86_64; we defined the additional property amd for the Ra nodes to distinguish them from Haku nodes. The Altix ock has IA64 architecture and the arch command returns ia64.

Use of switches for parallel jobs

Moab will try to schedule parallel jobs so that all nodes allocated to a job will come from a group of nodes on the same switch to reduce communication delays. This is automatic. The groups are internally called bc1, bc2, hs0, hs1, rs0, rs1, rs2, rs3, ws0. They denote the groups of 14 nodes in Arwen, of 16 nodes in Haku, of 19 nodes in RA, and of 16 nodes in wukong, that share the same switch.

Also note that by default Torque scripts start in your $HOME directory, not the directory where you are when you submit the job. The error and output files are created by default in the directory where you submit the job.

Moab scheduling principles

The Moab scheduler is programmed using the following principles. You can plan and organize your work and submit the jobs to get the best turnaround time for your jobs.

  • Jobs in the queue from users who not have any running jobs at that time, have preference over jobs in the queue from users who also have running jobs. As a result, if you have no jobs running, the first cores that come free will be yours.
  • Jobs requesting many nodes are considered first, and single node-jobs are scheduled to fill the "holes". The same is true for jobs requesting many CPUs. Because one can ask for many CPUs arbitrarily distributed -l ncpus=8 or for CPUs distributed over nodes in a special way -l nodes=2:ppn=2, the scheduler looks at the nodes first, because this requirement is harder to satisfy.
  • Jobs requesting a shorter wall clock time are considered before those with longer wall clock time limits. CPU time limits are not considered.

Serial job example

A simple serial job (download the job script file)

#!/bin/sh
#PBS -N serial
#PBS -o serial.out
#PBS -e serial.err
#PBS -l nodes=1:ppn=1:wukong
#PBS -l walltime=12:00:00
#PBS -l pmem=7500mb
echo Testing...
hostname
echo run a serial program with 7.5 GB of RAM...
echo done
	  
This job declares the name "-N serial" which will show up in the qstat, if not specified the filename will be the jobname.

The standard output and standard error files are specified with "-o" and "-e".

The flag "-m abe" specifies the a mail message should be sent to the address specified with "-M" when the job "b"egins, "e"nds, and "a"borts.

The queue in which to run the job must be specified and it must be consistent with the requested walltime "-l walltime=12:00:00". Note that Moab schedules jobs that request shorter walltimes sooner and delays jobs that ask for a very long walltime.

The "-l nodes=1:ppn=1:wukong" specifies that this job asks for 1 CPU resource, in the form 1 node and 1 processor per node. It also specifies the property "wukong", so that the job will run only on wukong nodes. You can also specify "-l ncpus=1". To specify the type of node, you then also need "-l nodes=wukong". Since the configuraqtion specifies that each node has 2, 4 or 8 CPUs, two jobs can be run on each physical node, if they each ask for one CPU resource. You can use qstat -f jubname to see what node/CPU combination has been allocated to your job(s). This is also shown graphically on the cluster status pages on this web site.

It is also a good thing to specify the memory required by your program with "-l pmem=900mb". The default is 900mb for QTP and 600 mb for HPC. For parallel jobs, this is the memory neede per processor.

Parallel jobs examples

The following is a parallel job to run an OpenMP or POSIX Threads shared memory, or OpenMPI distributed memory parallel program on a single wukong node. QTP clusters have 2, 4 or 8 cores per node so 8 is the maximum you can ask for this kind of parallel job. (download the job script file)

#!/bin/sh
#PBS -N para
#PBS -o para.out
#PBS -e para.err
#PBS -q quick
#PBS -l nodes=1:ppn=8:wukong
echo Testing single node parallel...
hostname
# some preparations here

# run your shared memory parallel program here
# for example g03 with 8 processors
g03 benzee.inp > benzene.log

# or run 8-way parallel MPI job with mpirun
mpirun vasp chickenwire.inp > chickenwire.log

# some cleanup here
echo done.
Shared memory parallel programs have multiple cores working on the same data, whereas distributed memory paralle programs have a section of RAM dedicated to each core and the cores send messages to each other to cummincate. If all parts of the distributed memory parallel program are running on a single multi-core node, then the communication is just a memory-to-memory copy and this is usually faster than commincationw with cores in othe rnodes. Especially on the QTP clusters where nodes can only communicate via Gigabit Ethernet, this is the case. The HPC Center cluster has nodes tha can communicate over InfiniBand and this is much faster than Gigabit Ethernet. See Programming Intro for more details.

The following is a parallel job to run a OpenMPI program on multiple nodes, with 2 CPUs (or cores) per node. (download the job script file)

#!/bin/sh
#PBS -N para
#PBS -o para.out
#PBS -e para.err
#PBS -q quick
#PBS -l nodes=2:ppn=2:arwen
echo Testing parallel...
hostname
# simple way....
mpirun hello > para.log 2>&1

# hard way...
echo PBS_NODEFILE
cat $PBS_NODEFILE
N=`wc -l $PBS_NODEFILE | awk '{print $1}'`
echo Nr nodes $N
echo starting hello...
mpirun -np $N hello > para.log 2>&1
echo done.
All flags are the same as for the serial job, except the "-l". It specifies that you request a list of hosts of certain types. The format is nodes=n1:ppn=M:type1 to ask for n1 nodes with n2 processors per node of type1. Currently two types are defined on arwen: bc1 and bc2 for BladeCenter 1 and 2 in the cluster arwen. These are nodes arwen0* and arwen1* respectively. It is advantageous for MPI jobs to run inside the same BladeCenter to reduce communication delays. The example job requests 2 nodes of type arwen and specifies ppn=2, i.e. both processors in the node. The scheduler will try to find two nodes in one of the two blocks of 14 nodes in arwen bc1 or bc2 that share a switch so that communication between the nodes is as fast as possible. As a result the PBS_NODEFILE will contain 4 entries (each name appearing eight times) for the job to use.

The environment variable PBS_NODES is set by PBS and is the name of a file with the hosts that are assigned to you by PBS.

You can download the small program hello.cpp to try running this job.

NOTE on LAM MPI (obsolete)
The LAM MPI commands create a temporary directory of the form lam-user@host-pbs-jobnr.arwen in the directory TMPDIR=/scr_1/tmp. This directpry will be deleted by the scratch cleaning daemon. To avoid this make a directory with the correct name and then add to your script
LAM_MPI_SESSION_PREFIX=/scr_1/tmp/hostname.myjob.PID export LAM_MPI_SESSION_PREFIX
Then lamboot will create the directory in a place that is safe as long as your job runs.

Administrator introduction

  1. Torque server daemon The server daemon pbs_server runs on wukong is the interface that points to the nodes of the cluster. The configuration of the daemon is done through commands. These are collected in the file server_config_cmds.

    All nodes are listed in a nodes file with a property of bc1 or bc2, indicating which BladeCenetr the node is part of and two CPUs, as follows:

    arwen00 bc1 arwen i686 np=2
    arwen01 bc1 arwen i686 np=2
    ...
    arwen10 bc2 arwen i686 np=2
    ...
    haku00 hs0 haku x8664 np=2
    ...
    haku10 hs1 haku x8664 np=2
    ...
    ra00 rs0 ra x8664 np=2
    ...
    ra10 rs1 ra x8664 np=2
    ...
    ra20 rs2 ra x8664 np=2
    ...
    ra30 rs3 ra x8664 np=2
    ...
    wukongf np=8 ws0 wukong x8664
    
    Jobs will be allocated to node/CPU slots.

    The properties defined for the nodes allow you to target specific groups of nodes easily.

  2. Torque execution daemon Each node runs a pbs_mom daemon with configuration file
    arch x86_64
    logevent       511
    $clienthost     wukong
    $restricted     wukong
    $ideal_load	8.0
    $max_load	10.0
    $usecp	*:/	/
    $tmpdir /scr_1/tmp
    size[fs=/scr_1/tmp]
    $timeout 120
    $fatal_job_poll_failure false
    	      
  3. Moab scheduler The scheduler Moab also runs on wukong. The scheduler configuration can be complex and is defined in the file moab.cfg. See the Cluster Resources website

>> top

Space Space Space
Space
Have a Question? Contact us.
Last Updated 11/20/08
 
University of Florida