QTP Home page | ||||
PBSPro 7.0 Batch System Guide
User introduction
The commands for managing jobs only work on the node arwen.
Queue definitionsThe queue definitions can be checked with qstat -q and qstat -Q -f
Property definitionsOpnPBS allows each node to be assigned properties. These can be specified in the job scripts so that you can direct PBS to allocated the correct node(s) to your job. You can check the properties with pbsnodes -a.
Use of switches for parallel jobsPBS will schedule parallel jobs so that all nodes allocated with come from a group of nodes on the same switch to reduce communication delays. This is automatic. The groups are internally called bc1, bc2, hs0, hs1, rs0, rs1, rs2, rs3. They denote the groups of 14 nodes in Arwen, of 16 nodes in Haku, of 19 nodes in RA that share the same switch.Also note that by default PBS scripts start in your $HOME directory, not the directory where you are when you submit the job. The error and output files do live in the directory where you submit the job by default. A simple serial job (download the job script file) #!/bin/sh #PBS -N serial #PBS -o serial.out #PBS -e serial.err #PBS -m abe #PBS -M deumens@qtp.ufl.edu #PBS -q quick #PBS -l nodes=1 echo Testing... hostname echo run a serial program... echo doneThis job declares the name "-N erik" which will show up in the qstat, if not specified the filename will be the jobname. The standard output and standard error files are specified with "-o" and "-e". The flag "-m abe" specifies the a mail message should be sent to the address specified with "-M" when the job "b"egins, "e"nds, and "a"borts. The queue in which to run the job is specified with "-q quick". The "-l nodes=1" specifies that this job asks for 1 CPU resource, which unfortunately is called "nodes". Since the configuraqtion specifies that each node has 2 CPUs, two jobs can be run on each physical node, if they each ask for one CPU resource. You can use qstat -f jubname to see what node/CPU combination has been allocated to your job(s). The following is a parallel job to run a LAM MPI program (download the job script file) #!/bin/sh #PBS -N para #PBS -o para.out #PBS -e para.err #PBS -m abe #PBS -M deumens@qtp.ufl.edu #PBS -q quick #PBS -l nodes=4:bc2:ppn=2 cleanup() { lamhalt -v echo job killed exit } echo Testing parallel... hostname echo PBS_NODEFILE cat $PBS_NODEFILE N=`wc -l $PBS_NODEFILE | awk '{print $1}'` echo Nr nodes $N lamboot -v $PBS_NODEFILE # Catch TERM and KILL signals to shut down MPI trap cleanup TERM KILL echo starting hello... mpirun -np $N hello > para.log 2>&1 lamhalt -v echo done.All flags are the same as for the serial job, except the "-l". It specifies that you request a list of hosts of certain types. The format is n1:type+n1:type2+... to ask for n1 nodes of type1 and n2 nodes of type2, etc. Currently two types are defined on arwen: bc1 and bc2 for BladeCenter 1 and 2. These are nodes arwen0* and arwen1* respectively. It is advantageous for MPI jobs to run inside the same BladeCenter to reduce communication delays. The example jobs requests 4 nodes of type bc2, i.e. in BladeCenter 2 and specifies a further type of ppn=2, i.e. processor per node equal to 2; as a result the PBS_NODEFILE will contain 8 entries (each name appearing twice) for the job to use. The environment variable PBS_NODES is set by PBS and is the name of a file with the hosts that are assigned to you by PBS. You can download the small program hello.cpp to try runningthis job. The LAM MPI commands create a temporary directory of the form lam-user@host-pbs-jobnr.arwen in the directory TMPDIR=/scr_1/tmp. This directpry will be deleted by the scratch cleaning daemon. To avoid this make a directory with the correct name and then add to your script LAM_MPI_SESSION_PREFIX=/scr_1/tmp/hostname.myjob.PID export LAM_MPI_SESSION_PREFIX Then lamboot will create the directory in a place that is safe as long as your job runs. Administrator introduction
|
||||
Have
a Question? Contact us. Last Updated 12/15/07 |