QTP Home page | ||||
LoadLeveler Scheduler Guide
LoadLeveler DefinitionsThe software that manages the many long running jobs on the SP systems and RS/6000 servers is called LoadLeveler. An overview of the different job classes and the principles of their use is expelained below. The set of classes are carefully designed to ensure that a user in need of computing resources can use a large portion of the machine if it is idle, without compromising the turnaround time of others. Any user is able to submit a job and the jobs should start to run within at most five days. Class is the term used to designate a class of jobs. On the QTP system, we use classes to divide jobs in a rough way according to total runtime and whether large disk activity is performed by the job (large input/output bandwidth used continuously). Server is the machine that registers with the LoadLeveler system to run jobs from a given class. Because LoadLeveler is so dynamic, a typo in a class name will not be registered as an error: LoadLeveler thinks that the server(s) for that class are temporarily down and queues the job, patiently waiting for the matching server to register itself. Requirements and Preferences allow finer and more flexible control over which server will run a job than the class. You can specify that some of the concepts below are a requirement for your job, which means the server must have the condition or feature, or that they are a preference, which means that you would like the job to run on a server that has the condition or feature, but that if another server is free, that your prefer the job to start earleir on the server without the condition or feature. Arch There are two architectures in the LoadLeveler complex: p2sc and power3. Xena III has 80 160 MHz POWER2SC CPUs, and XENA II has 135 MHz POWER2SC CPUs. Nodes yena76 through yena76 of Xena III have two 200 MHz POWER3 CPUs. Nodes simu1 through simu9 have four 375 MHz POWER3 CPUs. You must specify the Arch requirement, for example OpSys The operatig system can also be used to distinguish servers, but this characteristic is not used on our system. QUANTA is running the AIX 4.3 operatig system; XENA II is running AIX 5.1. That system needs the line Memory You can specify that your jobs requires or prefers servers with more RAM in MBytes than a given number. QTP has servers with 64 MB, 128 MB, 256 MB, 1024 MB and 4096 MB. You can specify Memory >= 1024 to request a server with RAM equal to or greater than 1 GB. You can monitor the available memory space with Disk Similarly a job can have a requirement or preference to have more than a given amount of disk space in KiloBytes available in one file system monitored by LoadLeveler. The scratch file system pointed to by /scr_1/tmp on each node of XENA II and III has been configured to be this monitored file system. You can specify Disk >= 1000 to request a server with at least 1 GB of free space in /scr_1/tmp at the time the jobs is started. Note that some file systems are shared by several servers anf therefore, this is no guarantee that the space will be available to that one job all the time. Nevertheless, the requirement or preference is useful to select the correct server to run on. You can obtain the available disk space space from LoadLeveler with or more specifically Feature In addition to the predefined requirements and preferences
listed above, there are a few site specific criteria, called Features.
No features are currently defined.
below.
Interactive nodes The SP nodes YENA00 through YENA70 and YENA7C
through YENA7F and XENA00,
XENA10, XENA20, XENA30, XENA40, XENA50, XENA60, XENA70, XENA80,
XENA90, XENAA0 XENAB0 should be
used for interactive work. You can submit jobs and check the status of
jobs and servers from your desktop workstation. You cannot log in to other
SP nodes. Consult the section on interactive
use for details on how to run interactive jobs on the SP.
LoadLeveler UseThe sections below present a quick overview on how to use the system effectively. Some of the material is repeated to serve as quick reference. Full documentation is available online. The best tool to learn about creating and submitting and following a job is with xloadl. This X-based tool is started by typing
From xloadl you can prepare a job. All fields are shown in fill-in panels and available choices are explicitly shown or available on a pull-down menu. Do not name a executable when creating a job script, always use the "append a script" option from the job creation menu. The reason is that environment for the executable cannot be controlled as precisely and this makes the use of this option more confusing, even to the expert user. Using a script is much more effective and much easier to debug. In this script you should create the necessary scratch directory in /scr_1/tmp for your job according to the conventions in place at QTP. You should then start the program that you wish to run from the script. Finally, you should save results from the calculation to a safe place and delete the scratch directory. Note the following: The directories must have the form "hostname"."string"."PID". In the job script you must specify a class. Details about the class can be found with one of the xloadl menus or with the command Once the job command file is created and the script appended and both are saved in a file with the name and location of your choice, you must submit the job. This can be done from xloadl, or when you become more experienced, with the command: After the jo is submitted, it gets assigned a name of the form pixi.1234.0 and it will show up in the queue. After some time it will start executing: Its status in the queue llq will become R. As long as the job is not running, you can check the evaluation of LoadLeveler about whether or not to run your job and on what server by running the command If a job, in the queued state or running, is incorrect and you want to delete it, use For debugging and testing it is important to be able to run interactive parallel jobs. Even though designed for parallel jobs, it can be used for serial jobs as well by asking for just one processing node. This is done through LoadLeveler and uses the special class interactive. Parallel MPI programs are started with the command poe, see online documentation under Parallel Environment for details. poe is the IBM equivalent of the MPICH command mpirun. When a command is compiled with mpxlf or mpxlc or any of the equivalent commands, the resulting executable has poe built in and the the program can be started in parallel by just naming the executable. Environment variables like MP_PROCS or arguments like -procs are used to tell the system what kind of parallel execution is required. To avoid over allocation of resources, the QTP system has been configured to channel all requests through the LoadLeveler system. This happens automatically. poe writes the command file and submits it to LoadLeveler. It assigns nodes and then poe starts the program on the allocated nodes. You can check with llclass interactive whether there are still free slots in the interactive class at any given time. If there are no free slots, interactive jobs started with poe will fail with an error message. It is possible, with an environment variable, to ask poe to keep repeating the request to LoadLeveler for a given time. There are several pools of nodes. To run interactively on P2SC nodes, log in to one of XENA*0 or YENA*0, and create a file host.list with the names yena00 through yena70 or xena00 through xenab0 one per line. MP_PROCS must be between less than the number of such nodes. Note that you cannot run in parallel over the SP switch mixing XENA II and XENA III nodes because they are part of a different switch. To run interactively on POWER3 nodes you must log in to YENA7C
through YEAN7F, and
make a host.list file with yena7c twice, one per line. MP_NPROCS must
be 2 in this case.
Job Classes and ServersThe Slater Lab has two LoadLeveler domains defined:
Selecting the right Class and Server
The requirements on the design of a workload management system are complex
and often contradictory: A small number of classes gives the system more
control to run jobs in the queue when slot become available, because all
slots are of the same kind. To provide specific classes for different jobs
is better to make sure the jobs run on the right system. The old
configuration of QUANTA opted for many classes, but then some CPUs
could turn idle if
a user chose to submit a job in one class, even though the job could run
equally well in another. With the expansion brought about by XENA, we
use the other approach of fewer classes. To make job processing efficient
soem new capabilities of LoadLeveler will be exploited.
Permissions to run JobsUntil now, the the configuration at QTP was such that users were only permitted to submit jobs in particular classes. This made it possible to guarantee that the system was used efficiently, but often required extra work on the user's part to find the class that was most likely to run a job soonest and sometimes required moving jobs from one class to another. This often happens with jobs that are appropriate for or fit in several classes. With the significant increase in compute power, the opposite approach will be tried: Everyone is allowed to submit any kind of job. The hope is that users will take the time to study the architecture of the system and the procedures and rules explained in these pages and make good use of the system. Any user who is found to abuse the privileges granted, will be denied
access to the system. Privileges will be restored only after he or she
passes a test on the architecture and proper use of the system.
Introduction to Submitting Parallel LoadLeveler Jobs
LoadLeveler supports many types of parallel jobs:
LoadLeveler controls nodes and MPI processes very precisely in the sense that one user cannot adversely affect any other user. There are two exceptions to that:
A typical LoadLeveler jobs consists of a shell script with LoadLeveler directives, marked by the @ sign at the beginning, embedded as comments. The easiest way to create such files is by using xloadl. #!/bin/ksh
In the sections below, the difference with the above job scripts are
shown for each of the kinds of jobs supported by a modern RS/6000 SP system.
MPI Jobs
# ---- begin specification
of the kind of parallel job
LoadLeveler reserves two slots on each of 2 nodes for this job and 2 fast switch slots in USER mode. There are now nodes with more than 4 CPUs, and then it is not possible in PSSP 3.1.1 to get a USER mode slot for each task if more than four tasks are requested. You can still use the switch in IP mode. SMP Jobs
LoadLeveler can reserve slots for CPUs to be used by shared memory programmming (SMP) executables. The following job asks for 3 CPUs and then starts an executable compiled with OpenMP that will run with three threads. AIX will schedule these threads on three CPUs. It makes the request by asking for one node and three tasks and not asking for any fast switch slots. Mixed SMP and MPI Jobs
The easiest way to get a mixed MPI SMP job is to request a number of dedicated nodes. LoadLeveler will allocate one task each on a number of nodes and not allocate any other tasks from any other jobs on the same node. It will also allocate one fast switch slot per node to the job. The host list passed to the poe command will contain the list of nodes and poe will therefore start one process on each node. Each process will then in turn start as many threads as there are CPUs on the nodes; in the example we assumed power3 high nodes with 8 CPUs.
# @ job_type = parallel # @ node_usage = not_shared # @ node = 2,2 # @ tasks_per_node = 1 # @ network.MPI = css0,not_shared,US # ---- end specification of the kind of parallel job ... # ==== begin executable statement export OMP_NUM_THREADS=8 /usr/bin/poe $HOME/mpiopenmp.exe -euilib us # ==== end executable statement
To make LoadLeveler allocate the correct number of tasks for this kind of parallel jobs is beyond the scope of this tutorial. LoadLeveler does provide a way to allocate the nodes and slots for jobs, but the host list passed to poe will convey to poe that one task should be started per allocated slot/CPU. This is not what one ones for this kind of job; rather one needs one task per node which then splits off one thread per CPU. |
||||
Have
a Question? Contact us. Last Updated 12/15/07 |