|
|
|
|
|
MPI Link-Checker 2.0.0 Application Note
Copyright 2004 by Microway Inc.
Microway's MPI Link-Checker is a software product which exercises an MPI cluster and uses the data collected to
find problems with the system it is running on. It can detect both subtle and gross problems with everything in
the MPI path: the OS, BIOS, device drivers, MPI implementation, motherboard, PCI bus, NICs, cables and switches.
It does this by statistically comparing the latency and bandwidth of all the nodes in the system.

Figure 1
MPI Link Checker Screen Shot of 22 Node Cluster: Opteron-InfiniBand
This release can be used in a number of different modes, including real time and off line. While it can detect a
whole host of issues, it does not pinpoint the precise cause of the problem, just the nodes which have problems
and the impact of the problem on latency and bandwidth. Often, problems like bad cables do not prevent a link from
functioning, just from functioning at full speed. In a large cluster running a parallel application which has
serial dependencies, a single bad cable can bring the entire cluster to its knees. The technique we originally
used at Microway to validate MPI clusters ran a suite of MPI applications and checked the accuracy of the results.
This technique did not catch the low level problems that MPI Link-Checker does. The real problem with simply
running benchmarks to validate a cluster's performance is that performance is sensitive to the software components
which frequently change. In addition, many parallel benchmarks are not sensitive to the individual hardware
components in a system, just the worst node. A single bad cable does not prevent a cluster from running, but it
can dramatically impact the overall speed of the cluster. So, when benchmarks under perform, you can not tell
whether you have a bad node, or possibly a system wide problem or a poorly implemented algorithm! There are times
when bad cables or NICs can introduce data integrity problems. This rare case does get picked up running a
benchmark like the NAS suite. MPI Link-Checker also detects this rare case and points it out by flashing a yellow
background on the bad node. For most MPI clusters, the things that MPI Link-Checker uncovers can be very
revealing about the operation of specific nodes. It can often find problems that are very subtle or that would
have been impossible to find without a tool which combines data collection techniques with a sophisticated visual
front end. A good example of a subtle problem can be seen in the latency and bandwidth plots in Figure 1, which
exhibit a pair of problems discussed below. Understanding these issues makes it possible to adapt algorithms to
the hardware and MPI implementation, being used.
The software can be run in a real time mode, displaying cluster data as it is collected, or in an off line mode
collecting data for days or weeks and presenting that data to the graphics front end on an as-needed basis. When
left in data collection mode, it can detect hard to find intermittent problems, as the bad results get collected
along with the good, but only show up when the statistics are set to find the worst results. The data collected
in off line mode can quickly add up to hundreds of megabytes, and takes longer to analyze. The analysis front end
statistically examines the data and presents it in a multi-dimensional format, making it possible to examine the
information in a rational manner. Often, a problem will show up for specific packet sizes and can affect either
latency or bandwidth, but not necessarily both. It can also appear in reception or transmission or both. It's
also possible for problems to fade in and out, with things like operating temperature or people disturbing cables
that are not properly connected or bad.
Figure 1 is a screen shot of a 22 node Opteron InfiniBand cluster. The two pinkish-red matrices which dominate
the figure are referred to as grids. They contain snapshots of the latency and bandwidth data collected for
particular packet sizes using the statistical summation technique chosen by the user. The plots below display the
same data as a function of packet size, for an individual node. Moving the cursor across the screen, changes the
node being plotted. Examining the last, best, average and worst cases, makes it possible to pinpoint things like
problems that are intermittent. To change the packet sizes being displayed on the grids, simply move the vertical
line in either plot with the mouse and then double click. The left and right hand side packet sizes are not
linked to each other. The algorithms used to collect the information are not identical, and as a result, the
information displayed in the left and right hand grids and plots, are not 100% correlated. By latency, we mean
the time required for a zero byte payload message to go back and forth between two nodes ? this is the so called
"ping pong" test. The left side can display either the latency or the packet transfer time. When we
measure bandwidth, we use an algorithm which floods the system with many packets at the same time, in an attempt
to saturate the fabric. We had to "back off" on the bandwidth paradigm we initially used to avoid
breaking one particular MPI release, demonstrating that the product can be used to partially verify MPI
operation. What that also means is that the product in its current form does not present the absolute highest
bandwidth achievable by a cluster.
The dark red cross that appears on both of the grids in Figure 1 indicates that a problem exists in this cluster
in both latency and bandwidth for node 8. The left hand cross tells us explicitly that for a packet size of 16
Kbytes, node 8 has a problem with packet transfer times. For all of the good nodes outside of the cross, the
average transmission time is 44 microseconds. For node 8, these times grow to around 90 microseconds. Looking up
the column of the cross, we can read off the transmission times between node 8 and the other nodes in the system.
These times vary between 91 and 93 microseconds. Looking across the row that forms the cross, we can read off the
receive times, which vary from 90 to 93. Similarly, looking at the right hand grid, we discover that for packet
sizes of 8 Kbytes node 8 only achieves bandwidths of 190 MB/sec, while the other nodes in the cluster typically
average 430 MB/sec! The packet size used for all the nodes in the grid can be read off of the heading at the top
of the grids along with the time or bandwidth units and the statistical method employed, which in this case is
average. The background colors chosen for the grid are summarized in the legend beneath each grid.

Figure 2
The values along the diagonals of the grids are the transmission times within each processor or node, which are a
strong function of how the particular MPI implements its shared memory paradigm. The master node, ma, is shown
in the upper left corner. For it, we show the transmissions times between two processes on one CPU and the times
between two processes on the same motherboard, but on different CPU's. Whether or not this information appears is
a function of how many processors we tell MPI it has to work with. If we select more, the diagonal will expand,
but the performance will go down. Since our goal is to explore problems with the hardware and not the MPI
implementation, we recommend not over populating the processors with MPI tasks. Looking at the legends at the
bottom of the grids, we can quickly determine that the transmission time within the CPU for a 16K packet is 1.8
microseconds while that between CPU's on the same motherboard is 18 microseconds. Similarly, the bandwidth for 8K
packets between processes on a single CPU is 5.6 GB/sec and between processors on the same node it goes down to
1.1 GB/sec. Using this resource of MPI Link Checker makes it easy to compare the MPI performance of four and
eight way processor systems to singles and duals. Both of the plots below the grids are for node 17 transmitting
to node 15. This can be seen by examining the location of the purple boxes in the lower right quadrant of both
grids. The X axis of the grids specifies the transmitter, while the Y axis specifies the receiver. Looking at the
transfer time plot, we discover that a zero byte packet takes 6 microseconds, which is the latency of this
InfiniBand combination. We also see a kink at 2 Kbytes, which is caused by the transition from Eager to
Rendezvous protocols within MPI. Looking at the right hand side, we can see the impact of this jump in latency
along with a drop off above 256 Kbytes. The latter is an artifact of the InfiniBand driver being used. The two
plots always refer to the same connection.
Figure 2 shows the same cluster running with smaller packet sizes (256 bytes on the latency side and 512 bytes on
the bandwidth side). Note that the crosses are much less visible as the difference in both latency and bandwidth
is now only about 15%. Note, that instead of looking at average times, we are now looking at the best times.
Being able to vary the statistics being used to look for problems turns out to be a very important feature of the
product. Using the tool to find problems with an intermittent cluster, there are times when the problem we are
looking for might occur only 5% of the time, or possibly only after a certain temperature gets hit. What makes it
possible to find these kind of problems is the fact that we can look for a worst case value, that is hidden
amongst rather good average values. When the worst case value only crops up on just a single, or small group of
nodes, that tells us we have found an intermittent problem.
Advanced Features
In addition to the latency and bandwidth plots and off line data collection features, Version 2.0.0 offers a
number of features which can be used to speed up data analysis. To reduce display times, we have made it possible
to reduce the number of nodes being presented. Rather than looking at all the data, we make it possible to divide
the cluster up into groups of nodes and then examine the data for each group against the others. This also makes
it possible to look at the cluster using the knowledge of the cluster topology, in a meaningful manner. If we
create the groups taking
advantage of this knowledge, the grids that result have uniform patterns. The new release contains a GUI which
makes it possible to define groups of nodes and then use stored definitions to drive the front end. Even if you
are willing to wait for the system to analyze a dense grid, you will still have another problem, and that is that
the tool does not display information when the cluster being analyzed has more than 40 nodes. While crosses still
show up, the details are no longer legible and don't get written to the screen. To solve the problem we added a
drill down feature, which makes it possible to blow into the grid, until the information becomes legible.
MPI Link-Checker can also be used to measure the transfer times between processes running on the same node and
the same processor! In Figure 2, we arranged the tasks, so that the master node had extra processes running on
it. The data we collected for the master node appears in the upper left hand part of the grid. From this we can
deduce that the transfer time for 256-byte messages between MPI processes running on the same processor is 250 ns
while the transfer time for processes on the same motherboard, 840 ns. The bandwidths are also displayed and peak
at 2.6 GB/sec. These results are very dependent on the MPI implementation being used.
Future Features - Very Large Clusters
The next version, Version 3.0.0, will include several features now under development for very large clusters.
These will simplify searching for bad nodes and will also make it possible to display the information on screens
much faster. The data collection process we use collects information that is proportional to the 2*N2*P, where N
is the size of the cluster and P is the number of packet sizes probed (typically 85). For a 40x40 cluster, this
requires us to run 272,000 fast executing probe routines. For a cluster that is 1,000 x 1,000, this jumps to 170
million probes. And, it turns out that some of these probes take a while to converge on accurate results, so the
data collection time can really add up! However, is it really necessary to collect the information between every
node in a cluster and every other node? We think not. Once the local structures of a large cluster are understood
and tested, it becomes possible to examine the way that the nodes in different regions interact with each other,
without checking all of the possible interactions. For this test to be useful, all of the intermediate switches
and cables in the cluster have to be tested, which is not that hard to arrange after the local switches and
cables have been scanned. We have a feature that will be shortly released which uses our grouping facility to
reduce the time it takes to define tests which do a complete search of a cluster connection space. Once these
tests have been defined, cluster verification gets reduced to running specific tests at periodic intervals, and
then either comparing the results with historic results or looking for "crosses" in grids. Included in
future features, is an automatic search facility which finds system problems so that System Administrators need
not examine the results of overnight tests, every morning.
Click here to request commercial release and pricing information
|
|
|