sales@microway.com
Phone: 508-746-7341
Fax: 508-746-4678
Microway (R) Technology you can count on.

21st Century Solutions




 


 

How good is your MPI?

The answer may surprise you...


The performance of an MPI application is often limited by the speed of its slowest node! Microway has developed MPI diagnostic tools to help you find the weak link in your cluster.


MMDS: Microway MPI Diagnostic Suite
Microway's MPI Diagnostic Suite™ consists of two applications: MPI Link-Checker™ and MPI Fast-Check™. The two applications perform complimentary functions. MPI Fast-Check performs a fast one-time test of all nodes in your cluster and reports any nodes that are clearly underperforming. MPI Link-Checker performs a much more extensive set of latency, bandwidth, and data integrity tests between all pairs of nodes.

MPI Link-Checker™
MPI Link-Checker™ measures the bandwidth and latency between all nodes in a cluster then summarizes the data graphically so you can easily spot problem nodes. The MPI Link-Checker™ tool detects issues with processor caching, motherboards, PCI busses, BIOS's, riser cards and PCI interconnects. It can even detect intermittent cables and cross bar switches!

MPI Link-Checker™ initiates an MPI application on each node and then uses it to collect data. It then displays a pair of screen plots that show the latency and bandwidth between all pairs of nodes in the system. It has a number of features that make it easy to discover interesting data about a cluster. For example, if a particular node has some problem with its PCI bus that is impacting the performance of that node, the problem might manifest itself as a small reduction in bandwidth or increase in latency. Problems like this are very difficult to spot unless you have a tool like this.

MPI Link-Checker™ makes it possible to identify a node that is under-performing by using different statistical methods to accumulate data. The problem node shows up as a large dark cross on the image. If there is more than one such node, a number of crosses will appear. The tool picks up gross problems, such as communications links that are not working or that are experiencing data errors, and subtle problems - links that appear to be working, but are not working reliably. At the end of the day, the tool doesn't tell you precisely what the cause of the problem is, just where the problem is, along with the fact that it exists.

In Microway's production environment, MPI Link-Checker™ picks up issues with clusters that have been burned in and validated about 50% of the time. Many of these issues are simple ones that do not manifest themselves by simply running MPI validation suites. For example, all that it takes to slow up a PCI bus is for a BIOS to incorrectly set up a particular slot or for a manufacturer to ship a riser that works at 66 MHz but not at 133 MHz. These minor mishaps occur all the time all the time and are very difficult to spot. Often all it takes to slow down an entire cluster running an MPI "fork and join" style application is for a single node to run slower than the rest. This is also the reason why veteran cluster users don't add new nodes to old clusters: the resulting cluster runs only as fast as the slowest nodes.

MPI Link-Checker™ can also be used to gather interesting statistics about your cluster that will help you to design algorithms that run efficiently on it. Self-connected regions of a cluster show up on the screen in a uniform color. Regions that are disconnected from other regions by connectivity barriers also are easy to spot. These barriers are a result of the fact that in many clusters, nearby nodes are connected by single or double hops through the switch. As a rule, the latency between two nodes increases with the number of hops through which they are connected. Regions connected by a single hop will show up lighter on the screen than regions connected by two hops. While the tool colors the screen with different colored boxes to represent the observed latency and bandwidths, it can also be set up to display the observed numbers on the graphic, and also to show detailed data for any connection on the screen below the graphic. MPI Link-Checker does this in real time. On large clusters, it takes a while with the current version to collect the n-squared pieces of information, and the representation loses detail. However, our tool makes it possible to drill down into any section of a cluster and recover the information for specific nodes or specific regions.

MPI Fast-Check™
Fast-Check provides a simple text based check of the cluster. When running on very large clusters, the standard Link-Checker may take minutes per pass. Fast-Check performs fewer total connections yet still tests every node. In as little as 5 to 10 seconds, you can get bandwidth and latency information as well as alerts to any problematic nodes.

Privacy    Legal Our mission is to provide customers with leading edge technologies for high performance computing solutions (HPC). We establish and maintain industry recognized products and expertise for beowolf cluster interconnect (InfiniBand), clustre management and HPC storage solutions.