Making the correct paths for the 192 cables that connect the 6 switches
in each pair of node frames to the 8 switches in the centrally located
switch frame proved to be a challenge. The length of the cables was barely
long enough to reach because of the shape of the room and because of the
fact that we put the cables up high. The advantage of putting the cables
high is that they are visible to visitors and shows the complexity of the
system more clearly. With a raised computer room floor, that complixty
is hidden from visitors.
Power Up
On April 12 at 11 am the system was powered up. It is fed by 3 breakers,
1 of 200A and 2 of 100A. With the system completely switched on the 200A
breaker was measured to supply 69A on each phase (very much in balance)
and the 100A breakers were giving 36A per phase. At startup 4 nodes and
1 switch were showing diagnostic lights of trouble. Once the control workstation
installation is progressed some, it will be possible to run diagnostics
on these nodes to determine what repairs are called for.
Four power supplies were replaced, on April 26, from the spare part
pool. This did not cause any down-time since the frames each have three
power supplies, one of which is redundant.
One of the twenty SP switches had a bad supervisor card. It too was
replaced, on May 16, with a switch from the spare parts pool.
These hardware failures were discovered during the software installation
and configuration of the control workstation. During this process, one
gains increasing control and information about the system as a whole.
First job
On July 31, LoadLeveler was started on about 147 nodes and they
started taking jobs. The first project was a novel protein folding
algorithm invented by Adrian Roitberg and implemented with Robert
Abel, an undergraduate working at QTP. The work consisted on 512 jobs,
each taking about 6 days.
While Xena II was working on the 512 jobs, nodes were repaired and
taken into service. On August 23 all 192 nodes were up and running and
the SP Switch was operating between them.
Building the global GPFS file system
Eight nodes are attached to 192 disks with 2.2 GB each using SSA
serial storage architecture, the precursor to the Fibre Channel
Arbitrated Loop architecture standard.
The disks are arranged in 3 racks with 4 drawers each. The eight nodes
are divided into four pairs and each pair is connected to a row of
three drawers, one in each rack.
Disks | Rack 1 | Rack 2 | Rack 3 |
Drawer 4 |
35.2 GB |
35.2 GB |
35.2 GB |
Drawer 3 |
35.2 GB |
35.2 GB |
35.2 GB |
Drawer 2 |
35.2 GB |
35.2 GB |
35.2 GB |
Drawer 1 |
35.2 GB |
35.2 GB |
35.2 GB |
Nodes |
Frame 2 |
Frame 4 |
Frame 6 |
Frame 8 |
Node 5 |
xena0a |
xena1a |
xena2a |
xena3a |
Node 3 |
xena09 |
xena19 |
xena29 |
xena39 |
Each node in the pair has two SSA adapters called ssa0 and
ssa1 and each adapter has two ports called A and
B which can be connected to one loop each. The set of three
drawers contains 48=3x16=4x12 disks and is divided into four
loops. Each loop contains 12 disks and one adaper/port of each
node. Every disk has a primary node, which under normal operation does
all data access to the disk, and a backup node, which will take over
the data access to the disk in case the primary fails. This way the
global data is always accessible even if one node fails.
Each SSA port has two access points to the loop A1 and
A2 or B1 and B2. the SSA protocal allows full
access and data transfer along both paths at the same time for a total
of 2 reads and 2 writes at 20 MB/sec.
By choosing the disks primarily assigned to xena3a as those
closest to xena3a in the loop, we make sure that no data access
from xena3a will interfere with data access from xena39
to its disks. Thus allowing maximum performance from both nodes to all
their disks. When one node is down or the loop is broken, the other
node still has access to all disks, but maybe at less performance.
Disks are grouped together into building blocks called VSD (virtual
shared disks).
All VSDs are made into one global files system using
GPFS, general parallel files system, and is accessed as /scr_2 on
every node in the system.
VSD with striping across adapters
Each VSD has one disk
in each loop connected to a host and data is striped across all
4. This way each write to
one VSD can proceed on each adapter freely to get high performance.
The loops and VSDs for /scr_2 look as follows
The labels next to the disks designate the global name of the disks.
We take one disk from each of the four loops a node is connected to
and put them together in a group of four and stripe them for optimal
performance. Such a group is called, e.g. d3c1h0na, which
stands for
drawer3connector1hopcount0nodea. The
connector can be 1 or 2 in each of the four loops (two adapters and
two ports). The hopcount is the number of hops from the connector to
the disk. The node designates the primary node for the disk.
VSD as string on an adapter
A second file system /scr_3 is constructed a bit differently to compare
the performance.
Here each VSD is build out of the 6 disks in each loop closest to the
adapter.
Although the layout of /scr_2 optimizes writes for a single VSD,
the GPFS files system always balances the load across all VSDs. Thus
every write to one VSD will be acompanied by writes to other
VSDs. Then the layout of /scr_2 may see contention on all loops,
whereas the layout /scr_3 will not. It is not clear a priori which
will be better. Maybe no option is better for every application.
One may be better for some read/write pattern. That is why we do the
test.
This second file system turns out to be 10% faster.
The labels next to the disks designate the global name of the disks.
Such a group is called, e.g. d0a0pan9, which
stands for
drawer0adapter0portanode9. The
adapter can be 0 or 1 and port can be a or b to make four loops (two
adapters and two ports). The node designates the primary node for the
disk.
VSD equal to one physical disk
The final form of the global scratch file system has one VSD per
physical disk. That way GPFS can optimize performance of acces to all
adapters and VSD servers. This form was implemented on Dec 18 2002
and resulted in /scr_2 size of 410 GB. This last configuration
turns out to be another 25% faster.
Configuration of Xena II
The XENA II system has 192 nodes with each a 135 MHz POWER2SC CPU,
1 GB of RAM and 9 GB of disk space. All nodes are connected by a 150
MB/sec full duplex, redundant path SP switch. The system has 420
GB of global storage consisting of 192 2.2 Gb disks on 16 SSA adapters
made available to each node through the SP switch as a GPFS (general parallel
file system).
Applications
XENA II will be used mostly for large scale production runs and for
experimentation with applications that require large RAM, MPI
programming, parallel algorithms and parallel I/O.
The standard distributed memory programming style with message passing,
most of the MPI 2.0 standard, and the IBM specific low-latency programming
style with LAPI, and the Cray T3E programming style with the SHMEM are
supported.
The system is suited for naturally parallel problems that Beowulf
clusters can run, but it also supprts problems that need fast access
to large global datasets and problems that require fast internode
communication, such as problems involving Fast Fourier Transforms.
|