Microway (R) Technology you can count on.  
               508-746-7341  
Microway Building
Clusters
Designers and Manufacturers of HTPC Solutions Since 1982
Workstations
Servers
Storage
White Papers
Custom Chassis
Software
Tech Overview
Alpha
Partners

 


MPI Link-Checker™ 2.0.0
Application Note
Copyright 2004 by Microway Inc.

Microway's MPI Link-Checker is a software product which exercises an MPI cluster and uses the data collected to find problems with the system it is running on. It can detect both subtle and gross problems with everything in the MPI path: the OS, BIOS, device drivers, MPI implementation, motherboard, PCI bus, NICs, cables and switches. It does this by statistically comparing the latency and bandwidth of all the nodes in the system.



Figure 1
MPI Link Checker Screen Shot of 22 Node Cluster: Opteron-InfiniBand


This release can be used in a number of different modes, including real time and off line. While it can detect a whole host of issues, it does not pinpoint the precise cause of the problem, just the nodes which have problems and the impact of the problem on latency and bandwidth. Often, problems like bad cables do not prevent a link from functioning, just from functioning at full speed. In a large cluster running a parallel application which has serial dependencies, a single bad cable can bring the entire cluster to its knees. The technique we originally used at Microway to validate MPI clusters ran a suite of MPI applications and checked the accuracy of the results.

This technique did not catch the low level problems that MPI Link-Checker does. The real problem with simply running benchmarks to validate a cluster's performance is that performance is sensitive to the software components which frequently change. In addition, many parallel benchmarks are not sensitive to the individual hardware components in a system, just the worst node. A single bad cable does not prevent a cluster from running, but it can dramatically impact the overall speed of the cluster. So, when benchmarks under perform, you can not tell whether you have a bad node, or possibly a system wide problem or a poorly implemented algorithm! There are times when bad cables or NICs can introduce data integrity problems. This rare case does get picked up running a benchmark like the NAS suite. MPI Link-Checker also detects this rare case and points it out by flashing a yellow background on the bad node. For most MPI clusters, the things that MPI Link-Checker uncovers can be very revealing about the operation of specific nodes. It can often find problems that are very subtle or that would have been impossible to find without a tool which combines data collection techniques with a sophisticated visual front end. A good example of a subtle problem can be seen in the latency and bandwidth plots in Figure 1, which exhibit a pair of problems discussed below. Understanding these issues makes it possible to adapt algorithms to the hardware and MPI implementation, being used.

The software can be run in a real time mode, displaying cluster data as it is collected, or in an off line mode collecting data for days or weeks and presenting that data to the graphics front end on an as-needed basis. When left in data collection mode, it can detect hard to find intermittent problems, as the bad results get collected along with the good, but only show up when the statistics are set to find the worst results. The data collected in off line mode can quickly add up to hundreds of megabytes, and takes longer to analyze. The analysis front end statistically examines the data and presents it in a multi-dimensional format, making it possible to examine the information in a rational manner. Often, a problem will show up for specific packet sizes and can affect either latency or bandwidth, but not necessarily both. It can also appear in reception or transmission or both. It's also possible for problems to fade in and out, with things like operating temperature or people disturbing cables that are not properly connected or bad.

Figure 1 is a screen shot of a 22 node Opteron InfiniBand cluster. The two pinkish-red matrices which dominate the figure are referred to as grids. They contain snapshots of the latency and bandwidth data collected for particular packet sizes using the statistical summation technique chosen by the user. The plots below display the same data as a function of packet size, for an individual node. Moving the cursor across the screen, changes the node being plotted. Examining the last, best, average and worst cases, makes it possible to pinpoint things like problems that are intermittent. To change the packet sizes being displayed on the grids, simply move the vertical line in either plot with the mouse and then double click. The left and right hand side packet sizes are not linked to each other. The algorithms used to collect the information are not identical, and as a result, the information displayed in the left and right hand grids and plots, are not 100% correlated. By latency, we mean the time required for a zero byte payload message to go back and forth between two nodes ? this is the so called "ping pong" test. The left side can display either the latency or the packet transfer time. When we measure bandwidth, we use an algorithm which floods the system with many packets at the same time, in an attempt to saturate the fabric. We had to "back off" on the bandwidth paradigm we initially used to avoid breaking one particular MPI release, demonstrating that the product can be used to partially verify MPI operation. What that also means is that the product in its current form does not present the absolute highest bandwidth achievable by a cluster.

The dark red cross that appears on both of the grids in Figure 1 indicates that a problem exists in this cluster in both latency and bandwidth for node 8. The left hand cross tells us explicitly that for a packet size of 16 Kbytes, node 8 has a problem with packet transfer times. For all of the good nodes outside of the cross, the average transmission time is 44 microseconds. For node 8, these times grow to around 90 microseconds. Looking up the column of the cross, we can read off the transmission times between node 8 and the other nodes in the system. These times vary between 91 and 93 microseconds. Looking across the row that forms the cross, we can read off the receive times, which vary from 90 to 93. Similarly, looking at the right hand grid, we discover that for packet sizes of 8 Kbytes node 8 only achieves bandwidths of 190 MB/sec, while the other nodes in the cluster typically average 430 MB/sec! The packet size used for all the nodes in the grid can be read off of the heading at the top of the grids along with the time or bandwidth units and the statistical method employed, which in this case is average. The background colors chosen for the grid are summarized in the legend beneath each grid.



Figure 2


The values along the diagonals of the grids are the transmission times within each processor or node, which are a strong function of how the particular MPI implements its shared memory paradigm. The master node, ma, is shown in the upper left corner. For it, we show the transmissions times between two processes on one CPU and the times between two processes on the same motherboard, but on different CPU's. Whether or not this information appears is a function of how many processors we tell MPI it has to work with. If we select more, the diagonal will expand, but the performance will go down. Since our goal is to explore problems with the hardware and not the MPI implementation, we recommend not over populating the processors with MPI tasks. Looking at the legends at the bottom of the grids, we can quickly determine that the transmission time within the CPU for a 16K packet is 1.8 microseconds while that between CPU's on the same motherboard is 18 microseconds. Similarly, the bandwidth for 8K packets between processes on a single CPU is 5.6 GB/sec and between processors on the same node it goes down to 1.1 GB/sec. Using this resource of MPI Link Checker makes it easy to compare the MPI performance of four and eight way processor systems to singles and duals. Both of the plots below the grids are for node 17 transmitting to node 15. This can be seen by examining the location of the purple boxes in the lower right quadrant of both grids. The X axis of the grids specifies the transmitter, while the Y axis specifies the receiver. Looking at the transfer time plot, we discover that a zero byte packet takes 6 microseconds, which is the latency of this InfiniBand combination. We also see a kink at 2 Kbytes, which is caused by the transition from Eager to Rendezvous protocols within MPI. Looking at the right hand side, we can see the impact of this jump in latency along with a drop off above 256 Kbytes. The latter is an artifact of the InfiniBand driver being used. The two plots always refer to the same connection.

Figure 2 shows the same cluster running with smaller packet sizes (256 bytes on the latency side and 512 bytes on the bandwidth side). Note that the crosses are much less visible as the difference in both latency and bandwidth is now only about 15%. Note, that instead of looking at average times, we are now looking at the best times. Being able to vary the statistics being used to look for problems turns out to be a very important feature of the product. Using the tool to find problems with an intermittent cluster, there are times when the problem we are looking for might occur only 5% of the time, or possibly only after a certain temperature gets hit. What makes it possible to find these kind of problems is the fact that we can look for a worst case value, that is hidden amongst rather good average values. When the worst case value only crops up on just a single, or small group of nodes, that tells us we have found an intermittent problem.

Advanced Features

In addition to the latency and bandwidth plots and off line data collection features, Version 2.0.0 offers a number of features which can be used to speed up data analysis. To reduce display times, we have made it possible to reduce the number of nodes being presented. Rather than looking at all the data, we make it possible to divide the cluster up into groups of nodes and then examine the data for each group against the others. This also makes it possible to look at the cluster using the knowledge of the cluster topology, in a meaningful manner. If we create the groups taking advantage of this knowledge, the grids that result have uniform patterns. The new release contains a GUI which makes it possible to define groups of nodes and then use stored definitions to drive the front end. Even if you are willing to wait for the system to analyze a dense grid, you will still have another problem, and that is that the tool does not display information when the cluster being analyzed has more than 40 nodes. While crosses still show up, the details are no longer legible and don't get written to the screen. To solve the problem we added a drill down feature, which makes it possible to blow into the grid, until the information becomes legible.

MPI Link-Checker can also be used to measure the transfer times between processes running on the same node and the same processor! In Figure 2, we arranged the tasks, so that the master node had extra processes running on it. The data we collected for the master node appears in the upper left hand part of the grid. From this we can deduce that the transfer time for 256-byte messages between MPI processes running on the same processor is 250 ns while the transfer time for processes on the same motherboard, 840 ns. The bandwidths are also displayed and peak at 2.6 GB/sec. These results are very dependent on the MPI implementation being used.

Future Features - Very Large Clusters

The next version, Version 3.0.0, will include several features now under development for very large clusters. These will simplify searching for bad nodes and will also make it possible to display the information on screens much faster. The data collection process we use collects information that is proportional to the 2*N2*P, where N is the size of the cluster and P is the number of packet sizes probed (typically 85). For a 40x40 cluster, this requires us to run 272,000 fast executing probe routines. For a cluster that is 1,000 x 1,000, this jumps to 170 million probes. And, it turns out that some of these probes take a while to converge on accurate results, so the data collection time can really add up! However, is it really necessary to collect the information between every node in a cluster and every other node? We think not. Once the local structures of a large cluster are understood and tested, it becomes possible to examine the way that the nodes in different regions interact with each other, without checking all of the possible interactions. For this test to be useful, all of the intermediate switches and cables in the cluster have to be tested, which is not that hard to arrange after the local switches and cables have been scanned. We have a feature that will be shortly released which uses our grouping facility to reduce the time it takes to define tests which do a complete search of a cluster connection space. Once these tests have been defined, cluster verification gets reduced to running specific tests at periodic intervals, and then either comparing the results with historic results or looking for "crosses" in grids. Included in future features, is an automatic search facility which finds system problems so that System Administrators need not examine the results of overnight tests, every morning.

Click here to request commercial release and pricing information

 

 

 

  Our mission is to provide customers with leading edge technologies for high performance computing solutions (HPC). We establish and maintain industry recognized products and expertise for beowolf cluster interconnect, clustre management and HPC storage solutions.