Space University of Florida - The Foundation of the Gator Nation
University of Florida College of Liberal Arts and Sciences
Space
Quantum Theory Project QTP Home page
Slater Lab

LoadLeveler Scheduler Guide

LoadLeveler Definitions

The software that manages the many long running jobs on the SP systems and RS/6000 servers is called LoadLeveler. An overview of the different job classes and the principles of their use is expelained below. The set of classes are carefully designed to ensure that a user in need of computing resources can use a large portion of the machine if it is idle, without compromising the turnaround time of others. Any user is able to submit a job and the jobs should start to run within at most five days. 

Class is the term used to designate a class of jobs. On the QTP system, we use classes to divide jobs in a rough way according to total runtime and whether large disk activity is performed by the job (large input/output bandwidth used continuously). 

Server is the machine that registers with the LoadLeveler system to run jobs from a given class. Because LoadLeveler is so dynamic, a typo in a class name will not be registered as an error: LoadLeveler thinks that the server(s) for that class are temporarily down and queues the job, patiently waiting for the matching server to register itself. 

Requirements and Preferences allow finer and more flexible control over which server will run a job than the class. You can specify that some of the concepts below are a requirement for your job, which means the server must have the condition or feature, or that they are a preference, which means that you would like the job to run on a server that has the condition or feature, but that if another server is free, that your prefer the job to start earleir on the server without the condition or feature. 

Arch There are two architectures in the LoadLeveler complex: p2sc and power3. Xena III has 80 160 MHz POWER2SC CPUs, and XENA II has 135 MHz POWER2SC CPUs. Nodes yena76 through yena76 of Xena III have two 200 MHz POWER3 CPUs. Nodes simu1 through simu9 have four 375 MHz POWER3 CPUs. You must specify the Arch requirement, for example 

Requirements = (Arch == "power3") && (OpSys == "AIX51")
because the default is powerpc, which is the architecture of the LoadLeveler manager workstation, and it will never run any job itself. 

OpSys The operatig system can also be used to distinguish servers, but this characteristic is not used on our system. QUANTA is running the AIX 4.3 operatig system; XENA II is running AIX 5.1. That system needs the line

Requirements = (Arch == "p2sc") && (OpSys == "AIX51")

Memory You can specify that your jobs requires or prefers servers with more RAM in MBytes than a given number. QTP has servers with 64 MB, 128 MB, 256 MB, 1024 MB and 4096 MB. You can specify Memory >= 1024 to request a server with RAM equal to or greater than 1 GB. You can monitor the available memory space with 

llstatus -l hostname 
or more specifically 
llstatus -r %m hostname 
For example, to ask for a power3 node with more than 1 gb 
Requirements = (Arch == "power3") && (OpSys == "AIX51")&&(Memory >= 1024)

Disk Similarly a job can have a requirement or preference to have more than a given amount of disk space in KiloBytes available in one file system monitored by LoadLeveler. The scratch file system pointed to by /scr_1/tmp on each node of XENA II and III has been configured to be this monitored file system. You can specify Disk >= 1000 to request a server with at least 1 GB of free space in /scr_1/tmp at the time the jobs is started. Note that some file systems are shared by several servers anf therefore, this is no guarantee that the space will be available to that one job all the time. Nevertheless, the requirement or preference is useful to select the correct server to run on. You can obtain the available disk space space from LoadLeveler with 

llstatus -l hostname 

or more specifically 

llstatus -r %d hostname 

Feature In addition to the predefined requirements and preferences listed above, there are a few site specific criteria, called Features. No features are currently defined. belowInteractive nodes The SP nodes YENA00 through YENA70 and YENA7C through YENA7F and XENA00, XENA10, XENA20, XENA30, XENA40, XENA50, XENA60, XENA70, XENA80, XENA90, XENAA0 XENAB0 should be used for interactive work. You can submit jobs and check the status of jobs and servers from your desktop workstation. You cannot log in to other SP nodes. Consult the section on interactive use for details on how to run interactive jobs on the SP. 
There are 192 nodes in Xena II, 172 of which are for computation, 12 are interactive nodes XENA00, XENA10, XENA20, XENA30, XENA40, XENA50, XENA60, XENA70, XENA80, XENA90, XENAA0, XENAB0, and 8 are parallel file system server nodes. These latter can be used for interactive use as well, but may be very slow if IO intensive jobs are running on the compute nodes.
There are 90 nodes in Xena III, 75 of which are for computation, 8 are interactive nodes YENA00, YENA10, YENA20, YENA30, YENA40, YENA50, YENA60, YENA70, and 8 are parallel file system server nodes. These latter can be used for interactive use as well, but may be very slow if IO intensive jobs are running on the compute nodes.

LoadLeveler Use

The sections below present a quick overview on how to use the system effectively. Some of the material is repeated to serve as quick reference. Full documentation is available online

The best tool to learn about creating and submitting and following a job is with xloadl. This X-based tool is started by typing 

xloadl &
and it shows several panels: 
  • One with the runnig jobs, the output of llq
  • One with the participating servers, the output of llstatus 
  • One with the messages from LoadLeveler about the commands you issued from xloadl 
A visual summary of llstatus, updated every 15 minutes, is accessible on the web

From xloadl you can prepare a job. All fields are shown in fill-in panels and available choices are explicitly shown or available on a pull-down menu. Do not name a executable when creating a job script, always use the "append a script" option from the job creation menu. The reason is that environment for the executable cannot be controlled as precisely and this makes the use of this option more confusing, even to the expert user. Using a script is much more effective and much easier to debug. In this script you should create the necessary scratch directory in /scr_1/tmp for your job according to the conventions in place at QTP. You should then start the program that you wish to run from the script. Finally, you should save results from the calculation to a safe place and delete the scratch directory. 

Note the following: The directories must have the form "hostname"."string"."PID".

In the job script you must specify a class. Details about the class can be found with one of the xloadl menus or with the command 

llclass
llclass -l classname will display all details about a specific class. There are also fields to specify the Requirements and Preferences defined before. The full explanation on how to use these is given later. Requirements are characteristis of the server, or set of servers for a parallel job, executing your job that must be present for your job to run sucessfully. Preferences are characteristics that you consider to be nice to have on the server(s) that run your job, maybe because then the job runs faster or more efficiently. You can specify whether you want to received e-mail from LoadLeveler when your job completes. 

Once the job command file is created and the script appended and both are saved in a file with the name and location of your choice, you must submit the job. This can be done from xloadl, or when you become more experienced, with the command: 

llsubmit jobfile
Job files are usually given a .cmd or .job extension. This is not mandatory. 

After the jo is submitted, it gets assigned a name of the form pixi.1234.0 and it will show up in the queue. After some time it will start executing: Its status in the queue llq will become R. As long as the job is not running, you can check the evaluation of LoadLeveler about whether or not to run your job and on what server by running the command 

llq -s pixi.1234.0
where we used the jobname of the example. This will show in detail what servers are elligible to run your jobs because they meet your Requirements and maybe Preferences, and why one or more of them they did not start your job yet. A valid reason could be that all elligible servers are busy running other jobs. An invalid reason could be that you asked for a class or Feature that does not exist by having a typo in the name. 

If a job, in the queued state or running, is incorrect and you want to delete it, use 

llcancel pixi.1234.0
and LoadLeveler will terminate the job if it is running and remove it from the queue. 

For debugging and testing it is important to be able to run interactive parallel jobs. Even though designed for parallel jobs, it can be used for serial jobs as well by asking for just one processing node. This is done through LoadLeveler and uses the special class interactive. Parallel MPI programs are started with the command poe, see online documentation under Parallel Environment for details. poe is the IBM equivalent of the MPICH command mpirun. When a command is compiled with mpxlf or mpxlc or any of the equivalent commands, the resulting executable has poe built in and the the program can be started in parallel by just naming the executable. 

Environment variables like MP_PROCS or arguments like -procs are used to tell the system what kind of parallel execution is required. To avoid over allocation of resources, the QTP system has been configured to channel all requests through the LoadLeveler system. This happens automatically. poe writes the command file and submits it to LoadLeveler. It assigns nodes and then poe starts the program on the allocated nodes. You can check with llclass interactive whether there are still free slots in the interactive class at any given time. If there are no free slots, interactive jobs started with poe will fail with an error message. It is possible, with an environment variable, to ask poe to keep repeating the request to LoadLeveler for a given time. 

There are several pools of nodes. To run interactively on P2SC nodes, log in to one of XENA*0 or YENA*0, and create a file host.list with the names yena00 through yena70 or xena00 through xenab0 one per line. MP_PROCS must be between less than the number of such nodes.

Note that you cannot run in parallel over the SP switch mixing XENA II and XENA III nodes because they are part of a different switch. 

To run interactively on POWER3 nodes you must log in to YENA7C through YEAN7F, and make a host.list file with yena7c twice, one per line. MP_NPROCS must be 2 in this case. 

Job Classes and Servers

The Slater Lab has two LoadLeveler domains defined:

  • One for Xena II and one for Xena III. This complex is accessed with the commands as described in all LoadLeveler documentation.
  • The SIMU cluster in the Visualization Lab is managed by Xena II. This complex is accessed by appending a "2" to the commands: for example llq2 to list the jobs in the queue.
  • The loadleveler domain for XENA III is accessed by appending a "3" to the commands: for example llq3 to list the jobs in the queue.
In the two LoadLeveler complexes some or all of the following classes are defined:
  • interactive for testing and debugging on some QUANTA and XENA nodes.
  • quick for short running serial and parallel development and production jobs on some QUANTA and XENA nodes. Thes jobs can have intense I/O or lareg RAM reequirements.
  • lhuge for long running, serial and parallel development and production jobs requiring very large RAM on QUANTA nodes 8 through 16.
  • lhugex for long running, serial and parallel development and production jobs requiring very large RAM on XENA II nodes.
In the two LoadLeveler complexes the following sets of servers are defined:
  • XENA III nodes with 2 CPUs, 1000 GB GPFS disk and 18 GB local disk; 1 GB or 4 GB RAM.
  • XENA III nodes with 1000 GB GPFS disk and 512 MB RAM.
  • XENA II nodes with 400 GB GPFS disk, 4 GB local disk and 1 GB RAM.
  • SIMU nodes with 1 or 4 CPUs, 18 GB local disk and 256 MB or 3 GB RAM.
Use llclass2 llclass3 and llstatus2 llstatus3 to find out details for the classes and servers at any time. Use the Architecture and Features in the Requirements and Preferences defined before as explained in the next section to direct your jobs to the proper server(s). 

Selecting the right Class and Server

The requirements on the design of a workload management system are complex and often contradictory: A small number of classes gives the system more control to run jobs in the queue when slot become available, because all slots are of the same kind. To provide specific classes for different jobs is better to make sure the jobs run on the right system. The old configuration of QUANTA opted for many classes, but then some CPUs could turn idle if a user chose to submit a job in one class, even though the job could run equally well in another. With the expansion brought about by XENA, we use the other approach of fewer classes. To make job processing efficient soem new capabilities of LoadLeveler will be exploited. 
Consult the disk usage documentation for general guide lines on how to decide where to put different files used by your jobs.

Permissions to run Jobs

Until now, the the configuration at QTP was such that users were only permitted to submit jobs in particular classes. This made it possible to guarantee that the system was used efficiently, but often required extra work on the user's part to find the class that was most likely to run a job soonest and sometimes required moving jobs from one class to another. This often happens with jobs that are appropriate for or fit in several classes. 

With the significant increase in compute power, the opposite approach will be tried: Everyone is allowed to submit any kind of job. The hope is that users will take the time to study the architecture of the system and the procedures and rules explained in these pages and make good use of the system. 

Any user who is found to abuse the privileges granted, will be denied access to the system. Privileges will be restored only after he or she passes a test on the architecture and proper use of the system. 

Introduction to Submitting Parallel LoadLeveler Jobs

LoadLeveler supports many types of parallel jobs: 

  • Pure MPI jobs with each task/process running on a seperate node 
  • Pure MPI jobs with all tasks running on a single SMP node 
  • Pure MPI jobs with multiple tasks/processes running on some or all SMP nodes 
  • Pure SMP jobs running on a single SMP node 
  • Mixed MPI and SMP jobs running on a one or more SMP nodes 
  • Mixed MPI and SMP jobs running on a one or more SMP nodes with dedicated nodes 
This brief tutorial gives example LoadLeveler command files to run jobs of each kind. It is assumed that the system has several SMP nodes and that a class par_class has been defined in such a way that the number of LoadLeveler initiators matches the number of CPUs on each node. 

LoadLeveler controls nodes and MPI processes very precisely in the sense that one user cannot adversely affect any other user. There are two exceptions to that: 

  • Intense IO by one jobs can adversely affect other jobs running on other CPUs in the same job by flooding the memory and IO subsystem. 
  • A user can request one LoadLeveler slot on an SMP node, once LoadLeveler starts the task on that node there is no simple way for the system to prevent that task from spawning multiple threads each competing for CPUs on that node, thereby adversely affecting jobs allocated by LoadLeveler to other slots on the same node. 
In introduction to creating parallel programs is available. 

A typical LoadLeveler jobs consists of a shell script with LoadLeveler directives, marked by the @ sign at the beginning, embedded as comments. The easiest way to create such files is by using xloadl. 

#!/bin/ksh
# @ output = mpi.out
# @ error = mpi.err
# @ class = par_class
# @ notification = complete
# @ notify_user = name@domain
# @ checkpoint = no
# @ restart = no
# @ requirements = (Arch == "R6000" && OpSys == "AIX43")
# @ wall_clock_limit = 10:00,9:00
#
# ---- begin specification of the kind of parallel job
# @ job_type = parallel
# @ node_usage = shared
# @ node = 2,2
# @ tasks_per_node = 1
# @ network.MPI = css0,not_shared,US
# ---- end specification of the kind of parallel job
#
# mandatory directive to process the job.
# @ queue
#
# User commands follow....
echo "pure MPI job on 2 different nodes:"
echo master: `uname -n`
# ==== begin executable statement
/usr/bin/poe $HOME/mpi.exe -euilib us
# ==== end executable statement
echo "Job done."

In the sections below, the difference with the above job scripts are shown for each of the kinds of jobs supported by a modern RS/6000 SP system. 

MPI Jobs

  • Pure MPI jobs with each task/process running on a seperate node 
  • LoadLeveler reserves one task/process slot per node on exactly two nodes for this job and one slot on the fast switch in USER mode. The poe command in the script starts the tasks on the allocated nodes. 

    # ---- begin specification of the kind of parallel job
    # @ job_type = parallel
    # @ node_usage = shared
    # @ node = 2,2
    # @ tasks_per_node = 1
    # @ network.MPI = css0,not_shared,US
    # ---- end specification of the kind of parallel job
    ...
    # ==== begin executable statement
    /usr/bin/poe $HOME/mpi.exe -euilib us
    # ==== end executable statement

    With the node keyword node=2,4 the user can specify that at least 2 nodes are required, but 4 are preferred. The remaining CPUs can be allocated to other serial or paralle jobs, but each task is guaranteed the full use of a single CPU, even if it does not fully utilize it. A tasks may fail to utilize a CPU for 100%, for example if it is wating a lot for messages from other tasks. 

    • Pure MPI jobs with all tasks running on a single SMP node 
      For this job LoadLeveler reserves two tasks on a single node. The specification for using the switch in USER mode is irrelevant for PSSP 3.1.1 or greater, since the communication across the switch to tasks on the same node is short-cicuited through shared memory with faster response. 

      # ---- begin specification of the kind of parallel job
      # @ job_type = parallel
      # @ node_usage = shared
      # @ node = 1,1
      # @ tasks_per_node = 2
      # @ network.MPI = css0,not_shared,US
      # ---- end specification of the kind of parallel job
      ...
      # ==== begin executable statement
      /usr/bin/poe $HOME/mpi.exe -euilib us
      # ==== end executable statement
       

    • Pure MPI jobs with multiple tasks/processes running on some or all SMP nodes 
    LoadLeveler reserves two slots on each of 2 nodes for this job and 2 fast switch slots in USER mode. There are now nodes with more than 4 CPUs, and then it is not possible in PSSP 3.1.1 to get a USER mode slot for each task if more than four tasks are requested. You can still use the switch in IP mode. 

    # ---- begin specification of the kind of parallel job
    # @ job_type = parallel
    # @ node_usage = shared
    # @ node = 2,2
    # @ tasks_per_node = 2
    # @ network.MPI = css0,not_shared,US
    # ---- end specification of the kind of parallel job
    ...
    # ==== begin executable statement
    /usr/bin/poe $HOME/mpi.exe -euilib us
    # ==== end executable statement

    A variation on the previous is the following example, where the user specifies the number of nodes and the total number of tasks, instead of number of nodes and tasks per node. This gives LoadLeveler extra flexibility in finding slots for the jobs so it is more likely to be scheduled quicker. Instead of 2 nodes with 2 tasks each, Loadleveler may allocate 1 task on one node and 3 on another, or 1 task each on two nodes and 2 on a third node, etc. 

    # ---- begin specification of the kind of parallel job
    # @ job_type = parallel
    # @ node_usage = shared
    # @ node = 2
    # @ total_tasks = 6
    # @ network.MPI = css0,not_shared,US
    # ---- end specification of the kind of parallel job
    ...
    # ==== begin executable statement
    /usr/bin/poe $HOME/mpi.exe -euilib us
    # ==== end executable statement
     

    SMP Jobs

    • Pure SMP jobs running on a single SMP node 
    LoadLeveler can reserve slots for CPUs to be used by shared memory programmming (SMP) executables. The following job asks for 3 CPUs and then starts an executable compiled with OpenMP that will run with three threads. AIX will schedule these threads on three CPUs. It makes the request by asking for one node and three tasks and not asking for any fast switch slots. 

    # ---- begin specification of the kind of parallel job
    # @ job_type = parallel
    # @ node_usage = shared
    # @ node = 1,1
    # @ tasks_per_node = 3
    # ---- end specification of the kind of parallel job
    ...
    # ==== begin executable statement
    # this is not needed for g98, you specify nr CPUs in benezne.inp
    # For OpenMP program with XLF 7.1, use
    export OMP_NUM_THREADS=3
    # For OpenMP program with XLF 6.1, use
    export XLSMPOPTS="parthds=3"
    $g98root/g98 benzene.inp > benzene.log
    # clean up shared memory segments with g98
    $g98root/bsd/clearipc
    # ==== end executable statement

    The program started in this example is the Gaussian 98 executable. It is assumed that the Gaussian command %NProc=3 is included in the input file benzene.inp. 

    Obviously, a request for an SMP job requested more tasks on a single node than any node in the SP complex has, cannot be satisfied. Currently that limit is 8 with the power3 high node. 

    Mixed SMP and MPI Jobs

    • Mixed MPI and SMP jobs running on a one or more dedicated SMP nodes 
    The easiest way to get a mixed MPI SMP job is to request a number of dedicated nodes. LoadLeveler will allocate one task each on a number of nodes and not allocate any other tasks from any other jobs on the same node. It will also allocate one fast switch slot per node to the job. The host list passed to the poe command will contain the list of nodes and poe will therefore start one process on each node. Each process will then in turn start as many threads as there are CPUs on the nodes; in the example we assumed power3 high nodes with 8 CPUs. 
      # ---- begin specification of the kind of parallel job
      # @ job_type = parallel
      # @ node_usage = not_shared
      # @ node = 2,2
      # @ tasks_per_node = 1
      # @ network.MPI = css0,not_shared,US
      # ---- end specification of the kind of parallel job
      ...
      # ==== begin executable statement
      export OMP_NUM_THREADS=8
      /usr/bin/poe $HOME/mpiopenmp.exe -euilib us
      # ==== end executable statement
    • Mixed MPI and SMP jobs running on a one or more SMP nodes with shared nodes 
    To make LoadLeveler allocate the correct number of tasks for this kind of parallel jobs is beyond the scope of this tutorial. LoadLeveler does provide a way to allocate the nodes and slots for jobs, but the host list passed to poe will convey to poe that one task should be started per allocated slot/CPU. This is not what one ones for this kind of job; rather one needs one task per node which then splits off one thread per CPU. 

    >> top

    Space Space Space
    Space
    Have a Question? Contact us.
    Last Updated 12/15/07
     
    University of Florida