September/October 2003
In this issue:

Bits and Bytes - News From Microway

Tech Tips

Directions in Parallel Processing, Stephen S. Fried

Media Spotlight - Clusterworld


VISIT MICROWAY AT SC2003 (BOOTH 2227) November 17-20, 2003 in Phoenix


Bits and Bytes Four Xeon processors in 1U - Microway's Newest Hardware Design
  We are pleased to announce the availability in January, 2004 of our new Xeon XBlade technology. This Microway-designed XBladeTM is the result of three man-years of engineering development. It will run all current Xeon speeds up to 3.06GHz and produce the most compact high speed 32-bit processing power available. Our 1U XBlade chassis with Across the BoardTM cooling is also designed in house. One cabinet will house up to 160 processors, running LinuxBIOS, networked with Myrinet connectivity. Pricing will be available in November. This platform will be available for viewing at Microway Booth 2227 during the upcoming SC2003 trade show in November 17-20 in Phoenix.


64-bit Operating Systems
Microway delivers servers and clusters with 64 bit Operating Systems. For complete information on either Red Hat or SuSE Linux operating systems integration, please contact your Microway salesperson or email productinfo@microway.com
For Itanium2 dual and quad processor servers and clusters we install Red Hat Advanced Server.
For OpteronTM dual and quad processor servers and clusters our O/S of choice is SuSE Linux Enterprise Server for AMD64
 

Clusterworld-Microway Navion TM Clusters Awarded Best 64-bit Turnkey Solution
  Microway's Navion clusters, based on the AMD Opteron won the award for Best 64-bit Turnkey Solution at the Clusterworld Conference and Exposition in San Jose, California. The Navion was chosen from among eight entrants for this award.
Navion clusters are designed for HPC and other memory intensive applications. They are offered in 1U dual or 4U quad processor chassis, up to 32 Gigabytes of memory per node and low latency, high bandwidth Myrinet interconnects. Microway's custom 1U CoolFlowTM chassis and CoolRackTM cabinets employ unique cooling techniques to yield the lowest CPU die temperatures in the industry. The award winning clusters include Microway's NodeWatch/ MCMS remote cluster monitoring and management tools. NodeWatch allows remote cluster management, temperature monitoring and complete cluster shutdown and restart via a web-based GUI. Platform Computing recently announced completion of Opteron compatibility testing of Platform LSF on Microway's Navion system.
 

Supercomputing 2003
  Visit us at SC2003, November 17-20, Booth 2227
Microway will exhibit several new products at SC2003 in Phoenix Arizona. Our senior sales and management people have attended this show for many years.
We value the Supercomputing show as an incubator for new ideas and a chance to meet many of our previous (and future) customers. This show provides HPC users the opportunity to make new alliances with companies exhibiting interesting technologies and ofers intense tutorials and a technical program to suit just about every interest. For more information, please visit http://www.sc-conference.org/
If you would like a complimentary pass to the exhibits, please send email to productinfo@microway.com with your mailing address.


Tech tips Bash shell tips and tricks
 
  1. Command line completion:
    Using tab can automatically complete command and file names. For example, if you wanted to run mkreiserfs on /dev/sda1, typing 'mkre' and hitting tab would fill in the rest of 'mkreiserfs'. You can then type '/d' and hit tab. This will leave you with '/dev/'. Using tab for completion can dramatically reduce the amount of typing needed to get something done.
  2. Skipping to the beginning or end of a line:
    CTRL-a skips to the beginning of a line. CTRL-e skips to the end.
  3. Cut and paste:
    CTRL-w cuts the previous word in your command. CTRL-k cuts from the current position to the end of the line. CTRL-y pastes. Try using CTRL-a and then CTRL-k to cut an entire line.
  4. Curly brackets ({}):
    This tip has very broad applications, a common usage is renaming a file from something.type to something.type.old. This command will do it: 'mv something.type{,.old}' Things entered with curly brackets expand to multiple items, 1 per value in the brackets. This particular command would evaluate to 'mv something.type something.type.old' something.type would have nothing added the first time, and .old the second time.
  5. Wildcards:
    Wildcards are special characters that can represent other characters. The asterisk (*) is the full wildcard. It represents anything. The question mark (?) can represent any single character. A range specified like [0-9] can represent any of the numbers in that range. These wildcards can be used to work on a large number of files at the same time. To list all files that start with the letter 'A', you can type 'ls A*'. For all files that end in a 3 character extension (.XXX), you could do 'ls *.???'. For all files with a character in the range of f-h in it (case sensitive-watch out), type 'ls *[f-h]*'.
  6. For loops:
    Scripting isn't just for scripts. Small one time use scripts can simply be entered at the command prompt. If, for example, you wanted to delete the file /tmp/foo from nodes 3, 7, 10, 13, 16, and 19, you could just rsh into each node and delete it. In a fraction of that time however, you could type 'for i in 3 7 10 13 16 19 ; do rsh node$i rm /tmp/foo ; done' This command would step i through the 'in' values, performing the commands between do and done.
  7. While loops:
    Another useful scripting tool is the while loop. Let us again imagine having to delete /tmp/foo from some nodes. However, this time it is 40 consecutive nodes from 21 to 60. This command would solve the problem quickly: 'declare -i count=21 ; while [ $count -le 60 ] ; do rsh node$count rm /tmp/foo ; count=$count+1 ; done' The declare -i count=21 command forces bash to use $count as an integer variable (variables default to strings for that type of assignment).
  8. The read function:
    The read function in bash is a bit tricky to get used to. It works by reading from standard input into variables specified after the function call. The first word of the input goes into the first variable, the second into the second, etc, etc. The last variable listed in the read command gets all the remaining words. With an input of 'Microway, Technology you can count on' to 'read var1 var2 var3' would leave var1 containing 'Microway', var2 with 'Technology', and var3 with 'you can count on'. An example use is as follows: 'cat /etc/hosts | while read ip fullname shortname ; do echo 'IP $ip has a fullname of $fullname' ; done'
  9. Redirects:
    When a program is run on a console (bash shell), it has 2 output targets, standard out (stdout), and standard error(stderr). These 2 output streams can be redirected to different places. If you want the normal messages from a program to be ignored, but want errors to be output, you can run 'command >/dev/null'. This will redirect the main output(stdout) to /dev/null, but leave the error output (stderr) untouched. Bash uses the number 1 to respresent stdout, and 2 for stderr. Running 'command 2>/dev/null' would do just the opposite of our previous command; it would display all normal messages, but dump errors into /dev/null. Another target of the redirect can be a file. 'command 2>/tmp/errorlog >/tmp/outputlog' would put all stderr messages in /tmp/errorlog, and put all normal output into /tmp/outputlog. The '>&' redirect sends all output streams to the same place. 'command >&/dev/null' would dump all output from a program into /dev/null.
  10. Pipes:
    Pipes are very simliar to redirects. Rather than directing output into a file or /dev/null, a pipe (the '|' character) redirects output as the input to another program. This is commonly used to redirect the output of a program into a paging utility like less or more. 'ls /dev | less' would list the files in /dev, one page at a time.
 
Intel Compiler Floating Point Assist Fault
  The Intel brand compilers are very robust and contain cutting edge optimizations. Intel also provides updates to their product is very frequently. For this reason, it is very important to register with Intel Premier Support to obtain all updates and bug fix information. This is stated in the Microway instructions you receive with your cluster or workstation if you have purchased Intel brand compilers.

By registering, you will receive your own permanent license file for your system, as well as a login and password for their support web site, which contains problem reporting opportunities and all relevent downloads.

Recently, there was a bug in a default optimization in the latest 7.1 version of the Fortran compiler for LINUX. This resulted in floating point assist fault errors in the log files and invalid answers.

The following code produces this problem. It has been corrected in the 8.0 build 25 (8.0.025 ) version of the compiler:
! if you use the commented lines of code below, the problem goes away. program problem real*8 ar(1),ar2(1),dr ar2(1) = 3.13789d0 ar(1) = 2.01047d0 do i=1,1 dr = dexp( -( ar(i) - ar2(i) ) ) ! dr = ar(i) - ar2(i) ! this is OK enddo ! dr = dexp( -( ar(1) - ar2(1) ) ) ! this is OK too write(*,*)dr end


Directions in Parallel Processing
Stephen S. Fried
The effect of interconnect latency on cluster efficiency running fine grain problems

Over the last 15 years, there has been a 6,000 fold increase in x86 CPU floating point throughput. At the same time we have seen only a gradual improvement in the bandwidth and latency of interconnects which handle communications between processors. As a consequence, the size of the smallest problems that could be parallelized has dramatically increased. Where it was once possible to connect processors together and have them efficiently run problems whose fine grain parallel granularity was a single floating point instruction, today the smallest fine grain parallel problem that can be run on a cluster of Xeons needs to carry out 500,000 floating point operations! Clearly, issues other than CPU floating point performance have contributed the increase in the size of the smallest problems that can be parallelized. This article addresses the reasons for this increase in size and also compares the impact of latency on the size of fine grain problems running with different interconnect technologies.

To begin with we provide a comparison that illustrates the problem. In 1990, the fastest commodity processor was the Intel i860. The i860 outperformed the Pentium of its day by a factor of 20. This device provided 80 megaflops of throughput and 160 MB/Sec of interconnect bandwidth. Its bus was clocked at 20 ns and could be directly hooked up to interface devices. This meant that its effective latency was just 10 microseconds greater than the typical high quality interconnects available today, while its bandwidths were comparable.

At the same time that i860's were being used to build Supercomputers, a number of university researchers started hooking up PC's using 100 Megabit Ethernet. This was the start of the Beowulf revolution, which in my opinion ought to be called the MPI revolution, as it was the development of this interface library that resulted in a single standard for writing parallel applications. Today a typical Xeon or Opteron can crank out 6,000 Mflops. While the CPU's we use have gotten much more powerful, the majority of the clusters we build today, use interconnects which are at best no better than those of the i860. Based on this metric, the ratio of FPU throughput to interconnect bandwidth has fallen off by a factor of 75. However, this is not the complete story. Some of the interconnects used today were designed to reliably connect computers over the web that are hundreds of miles apart. The TCP/IP based backplanes used in some Beowulf clusters today continue to rely on a protocol which is designed for long distance communication. Long distance communication is not always reliable and TCP/IP includes features to increase reliability which also increase transfer time. This in turn increases the interconnect latency of clusters which connect through the "TCP/IP stack" by a factor of 10!

While the granularity of the problems that can be tackled has been on the rise, it does not necessarily mean that fine grain problems have been completely neglected. There is a tendency in the parallel processing community to respond to hardware bottlenecks with new algorithms. The bottlenecks are sometimes artificial. The Transputer was more efficient at solving fine grain problems than today's processors because it had built in DMA engines. However, users often took the easy way out when it came time to linking processors together. Instead of using the four links available on each CPU, they often chose to use just two links and connect them together in rings. Researchers then starting publishing papers on algorithms designed to run parallel applications on processors connected by rings. Similarly, problems that were heretofore considered too fine grained to solve on clusters are now being solved using new algorithms which do the computations in larger chunks, avoiding the inefficiencies being placed in their path by increasingly underperforming interconnect schemes.

There is a limit to improvements which can be expected by mangling algorithms so that they run efficiently on parallel machines whose fine grain parallel thresholds are continually being exceeded. Eventually, every problem reaches a limit where no amount of chicanery can turn it into a coarser grained problem. When these limits are reached, the only solution left is to improve the performance of the interconnects. The point where a problem fails to scale is a function of the efficiency that the users desires from the cluster. Assume we need to build a 90% efficient cluster. This point gets reached when the time to communicate is 10% of the computational load. One of the conclusions we will draw below is that the smallest problem we can efficiently parallelize on a system made up of dual Xeons requires us to perform a million floating point operations between inter-node communications.

Three big problems behind the increase in the size of the smallest problems that can be parallelized are the increased complexity of CPU's, the difficulty of driving busses at high speeds and the finite speed of the speed of light. It turns out these are the same problems that one encounters when designing chips today. However, the circuits in a processor die take just picoseconds to get their job done, so it is still possible to build semiconductors which share computational units which efficiently communicate with each other running in parallel. Starting in 1988, it became impossible to build processors that were located on two or more chips. This is the reason that Intel built the i860's FPU into its die and was forced later on to follow suit with the Pentium, doing away with its very profitable math coprocessor business. The challenge associated with communicating between chips on a printed circuit board is much worse than that of communicating between the units in a CPU and has continued to worsen with time. As the clock speed of the busses used to connect components moves above 66 MHz, the number of devices that can be connected together by a bus goes down. For example, where a 66 MHz PCI bus could connect to four PCI cards, a 133 MHz PCI-X can only be populated by a single card (i.e., above 100 MHz, only point to point communications work). This also implies that the devices which disseminate I/O signals across a motherboard all behave as hubs. In the case of a Xeon, the processor bus (called the front side bus) runs at 533 MHz and terminates in a device called the Plumas MCH (memory controller hub). The MCH is able to synchronously drive six new busses, because the signals used to drive these busses originate on die, where things still happen fast enough to synchronize busses. The busses it drives includes a pair of memory controllers and four I/O channels. Three of the I/O channels can be used to drive very high speed hubs which in turn can drive a pair of PCI-X 133 MHz slots. This takes advantage of the fact that the hubs on die circuits are fast enough to split signals and use them to drive a pair of external busses. The bottom line is that when a Xeon has to communicate with the outside world, it now has to pass its data out through a chain of hubs, each of which takes time and adds latency.

Personal computers are often bottlenecked by communications bandwidth, because of the volume of information pulled down from sources like the web. Bandwidth is less of an issue with the nodes of high performance clusters performing fine grain parallel problems. The primary concern of this article is not bandwidth, but the much more important subject of latency in a distributed memory parallel system, what causes it and its implications for the efficiency of a cluster.

The latency of a cluster interconnect can be defined as the time it takes a small packet to go from one node to another. For small packets, the latency becomes the rate limiting effect that determines the execution speed of fine grain parallel problems. A message going from the FPU registers of Xeon to the PCI bus passes through at least two hubs before it gets driven out onto the PCI bus. As it gets to each of these devices, the message passes through a FIFO (first in first out buffer). Some time later, the FIFO's contents are dumped out onto the output side of the hub and passed to the next device in the chain. Eventually it arrives at the PCI bus, where it has to wait again, this time for something called arbitration to occur, before it can finally make its way into a card designed to drive it out onto a "fabric." The fabric contains one or more devices which route the message to a card located in the PCI bus of the destination node, where it gets to repeat the process of entering a PCI bus and then making its way to the memory of the destination node. It typically takes close to a microsecond for data leaving an FPU to make it onto the PCI bus and into a card plugged into the bus.

Working with PCI hardware developed by StarGen which makes it possible to share memory between the PCI nodes of a cluster, we did a number of experiments with an MPICH port to their fabric. Their interface uses PCI bridges to connect PCI busses on different nodes together and as a result is very clean. Using the processors to move data from the memory on one node to another, we have been able to send small data packets between nodes in just 3.2 microseconds. We determined the latency of the cross bar used by their fabric by adding a second cross bar, which increased the raw hardware transfer time by a microsecond. We believe that the 3.2 microseconds measured is close to the best latency that can be achieved at the present time, connecting a pair of PCI nodes. We have heard claims for cross bar switches used in fabrics whose latency is just 200 ns, which would reduce the 3.2 microseconds down to 2.5. However, in practical large scale fabrics, the number of hops required to pass through the cross bars used in switches is often between three and six, so the best hardware latency times will almost always be at least 5 microseconds.

The experiments just described provide us with a lower limit on the latency that we can expect from a cluster whose nodes are connected by fabrics that communicate over the PCI bus. However, most commercial interconnects typically take 5 or more microseconds for small transfers between nodes to complete. Where it was possible with simple CPU's like the 8088 or 80386 to communicate between processors (such as an 80386 that was attached to an 80387) in times that were comparable to a CPU's cycle times, it now takes hundreds to tens of thousands of CPU cycles for messages to travel from one processor to a neighbor simply because modern pipelined CPU's are no longer intended to work directly with attached hardware.

With the arrival of the Opteron processor, it is possible to build small shared memory clusters which have significantly better latencies than "PCI connected" clusters. In situations where four processors sharing memory provide adequate parallel performance, the latency provided by the Opteron's HyperTransport bus, will be an order of magnitude smaller than those provided by PCI connected nodes. The Opteron's memory latency for a "cross court" transfer is just 140 ns. To transfer data from the memory of one CPU to another takes twice this, or 300 ns. This is roughly a factor of 10 less than the 3.2 microseconds it takes to make a memory to memory transfer between PCI connected nodes. However, even here, the latency will increase the moment we start talking about MPI, which adds a microsecond or more to the time it takes to send small messages. The implication here is if you want to take advantage of the Opteron's Transputer-like legacy, you will need to use a clean library interface, like that afforded by SHMEM.

Using calls to clock routines, it becomes possible to measure the time it takes for messages to pass through the various stages of an MPI transfer. On Xeons we have measured the overhead associated with MPICH 2.0 by subtracting the time it takes for the hardware to send a message from the total time measured for an MPI transfer. This time is 1.1 microseconds. We have done a large number of latency and bandwidth measurements at Microway running MPI on a number of different interconnect solutions. The smallest total latency measured to date is 4.5 microseconds using the StarGen cards and a fabric with a single 6 port switch. The technique used by StarGen is used by other hardware, but is normally not exposed by software as an API. We think this technique will become increasingly popular in the future. It is frequently referred to as RDMA (remote direct memory addressing). A number of new architectures including InfiniBand make it possible to perform RDMA operations. RDMA is also being experimented with by a at least one start up which plans to release NIC like devices which accelerate conventional TCP/IP transfers using standard GigE switches. One might think that using more efficient shared memory libraries like SHMEM ought to have a payoff in an RDMA environment. However, since our measurements show that MPI only adds 25% of the latency to an RDMA enabled distributed shared memory system, improving the library can at most only make a 25% improvement in speed for PCI connected nodes.

The latencies just mentioned are the best case scenarios. The typical latencies found in inter-processor interconnects designed for the current HPC market run between 5 and 10 microseconds and are over 50 microseconds for standard GigE. Assuming an average of 10 microseconds, we discover that the average latency of a high performance HPC interconnect is roughly 30,000 Xeon clock cycles. This turns out to be the amount of time it takes a 3 GHz Xeon to perform 60,000 64-bit floating point operations. To determine the smallest problem we can effectively parallelize, we must compare the time to communicate with the time to compute. For a problem in which these times are equal, the efficiency of a single node in a cluster will be 50%. The following relationship tells us what the efficiency of a single node in our cluster is: parallel efficiency = Tcomp /(Tcomp + Ttrans)

where the times are those to perform a computation and transfer the data needed to support the computation. This simple model assumes no overlapped operations, which is probably a good assumption for novice users writing MPI applications. However, our goal in this business is to produce clusters which scale as we add nodes. If we desire a cluster whose overall efficiency is 90%, then this increases the burden on our latency by roughly a factor of 9. We can no longer be happy with the case where we are 50% efficient, but now have to shoot for the case where the computation takes up 90% of the time and the transfer of data, just 10%. In the case of the Xeon, this means the smallest amount of time we can run our Xeon is 90 microseconds, if our goal is to make up for a 10 microsecond latency hit. This increases the size of the smallest problem that "scales," from 30,000 CPU cycles to 270,000 CPU cycles or 540,000 floating point operations. And, taking into account that the typical Xeon node has a pair of processors, this increases the fine grain parallel limit to 1.08 million floating point operations. This last figure is starting to approach the number of operations required to perform LINPACK on a 100x100 array, hardly a small problem by any means. However, examining the GigE case, where the latency is 50 microseconds, we conclude that the smallest problem that we should consider parallelizing will have to perform 5 million floating point operations! These are the reasons why DARPA is now funding development work at companies like IBM to develop chips which contain many parallel units running on the same die. Placing many FPU's and complete memory systems on a single die dramatically improves the latency of both memories and inter-processor communication. This approach has worked well in the DSP world, where chip volumes are huge and processors with four or more FPU's are common place. It has also been applied in a minor way in the x86 line, using SIMD instructions, which execute up to four floating point instructions at the same time. However, there is a limit to what you can do with SIMD devices, due to the awkwardness of working with VLIW (very long instruction word) instruction sets. These new devices are being designed to get around these problems, and will hopefully result in parallel computers which can again run small fine grain parallel problems at very high speeds.

The plot below can be used to compute the 50% efficient point for the four technologies we have been experimenting with. This plot was not made from the data collected, but using a two parameter model that characterized each of the devices. This model uses the latency and asymptotic measured bandwidth numbers, and approximates the actual data using a straight line. The top line shows the performance of GigE for small MPI packets. Assume we are doing a computation which transfers 5,000 bytes between nodes at each iteration of an algorithm. A GigE connection will take about 50 microseconds (its latency) plus the time it needs to move 5,000 bytes at a rate of 80 MB/Sec across the nodes. This turns out to be 112.5 microseconds. On the right hand side is the corresponding number of double precision floating point operations that a pair of Xeons can carry out in the same time. This plot basically reveals that using GigE to get to 50% efficiency, the processor needs to perform over a million operations. However, this will only provide us with 50% efficiency. In the real world, to get to 90% efficiency, at which point it would be worthwhile running this on a cluster with 10 or more nodes, we would actually need a problem which actually performed 9 million floating point operations!

The three bottom lines represent technologies which provide "low latency" interconnects. The best results are those lines which lie as close to the X axis as possible. We are not trying to pick favorites here - the times given are for specific hardware, but as products from companies like Myrinet catch up with the speed of InfiniBand when these measurements were made, they will follow the same line as the InfiniBand line on this plot. The high speed winners of this contest turn out to be any interconnect which can take full advantage of the PCI-X bus and also provides 10 microseconds or less of latency. The bottom line was taken with a 4X InfiniBand card from Mellanox. This device transfers 800 MB/sec and has a 7 microsecond latency - it also comes very close to saturating the PCI-X bus, and you only get the 800 MB/sec transfer rates when the packet sizes get very large. The StarGen and Myrinet fabrics had bandwidths of 160 and 235 MB/Sec, respectively. For very small transfers, the winner turns out to be the StarGen solution simply because it provides the cleanest interface to the PCI bus and can be programmed without the use of interrupts. The problem with StarGen is that it does not scale to clusters larger than 64 nodes. For any packet larger than 1000 bytes, the current winner is the one that saturates the PCI-X bus, which makes it possible to achieve 800 MB/sec of bandwidth for very large packets.



Media Spotlight As a service to our readers, we will from time to time in this column tell you about publications and trade shows that we believe impart useful information. This column will start with the new Clusterworld Magazine and Trade Show. What follows is an announcement provided by their management. Feedback is appreciated.

Announcing ClusterWorld

Clustered solutions represent a significant and growing IT market segment, deployed in an ever-increasing number of key industry verticals and major non-profit enterprises such as research labs and academic institutions. Clustered solutions includes high performance computing, grid and parallel computing systems, and mission-critical systems where high availability and redundancy are mandatory. Clustering deserves its own trade show and its own magazine to meet the specialized information needs of cluster users and buyers on a regular basis.

QuarterPower Media, publishers of Linux Magazine for the past five years, are pleased to announce the launch of our newest monthly publication, ClusterWorld Magazine. We are also pleased to announce FREE three-month trial subscriptions to ClusterWorld to qualified readers. To request a subscription form, please email circulation director Joan Braunstein: jbraunstein@linux-mag.com.

With the support of over 50 event sponsors and exhibitors we successfully launched the first ClusterWorld Conference and Expo (CWCE) in June, 2003. At the show, Microway won a prestigious ClusterWorld Award for the Best 64-Bit Turnkey Solution. The next CWCE is slated for 6-8 April 2004 in San Jose. For information on the 2004 Conference program and how to register and take advantage of early bird pricing, please visit our website: clusterworldexpo.com.



If you no longer wish to receive this newsletter, please reply to this email and change the subject line to UNSUBSCRIBE.