|
You
Can Count On It by Stephen Fried
|
First
Look: The AMD Opteron™ Processor I
had the great pleasure to attend the AMD Opteron™ launch on
Tuesday, April 22, in New York City. Executives from AMD, IBM,
Computer Associates, Microsoft and Oracle told the press how easy
it was to port their software to Opteron. Of course, what they
were telling us was obvious – since the Opteron supports
the same 32-bit instruction set as the Intel® Xeon™,
porting to it should be a snap. In fact, you don't even have to
port to it, all you need to do is run your code. If it doesn't
run, something is wrong. However, if you have code designed to
run on both 32- and 64-bit machines and you have “if def'd”
your C++ sources, inserting macros to handle 32- and 64-bit
issues (something required to get code to run on both Alphas and
Xeons), then even porting to the new architecture is still going
to be a snap, as the only thing that will change is the size of
the integers. Applications which will benefit easily from 64-bits
ought to be file servers and data base managers, which use large
memories to cache data. While it will be fairly straightforward
to get the memory and cache size based wins, its going to take a
bit longer to see the total benefit which AMD64 technology can
bring to the floating point world. To really hit full speed, you
will need to take full advantage of the Intel SSE2 instruction
set along with the extensions to SSE2 that AMD added to the
Opteron. And, as of this writing, even though PGI is working this
problem, the best compilers by far for generating high quality
SSE2 code still come from Intel.
The Opteron brings a lot to the HPC
table. It contains a number of features that have appeared in the
past in parallel processing engines, including the Inmos
Transputer and the Alpha EV79. All of these devices contain a
memory controller built into the chip in addition to the CPU.
They also contain additional circuits which make it possible for
the CPUs to talk to their nearest neighbors. There are several
major benefits to this approach. First, Opterons can communicate
between themselves without having to send packets down a bus
which they share with other processors and which terminates at a
memory controller hub. In a two processor Xeon system, in which
the Northbridge (memory controller) has the same bandwidth as a
single Opteron, the Northbridge is shared between two CPUs and
the total bandwidth available to a Xeon pair will be half that
available to an Opteron pair – the Opterons in this
situation have two memory controllers compared to the Xeon pair’s
one. In addition, the time to fetch data out of memory for the
Opterons will be less, simply because the only off-chip access
will be the control signals required to drive the memory bus.
Next, in an Opteron pair, each
processor can access the memory of the other. This is done by a
low latency high speed bus which maintains internal cache
coherenency, and is known as the coherent HyperTransport™
bus. The efficient Inmos Transputer parallel architecture, in
which each CPU had its own memory controller and four CPU
interconnects called Links, is closely mimicked by the Opteron.
Altogether there are three instead of four Links, which use a bus
that takes the new name contributed by AMD –
HyperTransport. Two of the three hubs are coherent are used to
link Opterons together. The third is not, and is used by the
processor to talk to lower speed I/O busses. The current
generation of Opteron can link four CPUs together – in the
future this will jump to eight.
HyperTransport is a “pass it
along” style interconnect, i.e., a sequence of
HyperTransport devices can be linked up in a row and talk to each
other. This is not exactly what happens in a bus, and the reason
this new approach had to be taken is the typical problem that we
encounter with all electronics when it comes time to speed things
up. The problem is called fan out. In a bus, everyone sits on the
same group of parallel lines and listens in on the conversation.
When the message is intended for them, they figure that out by
decoding the bus’s addresses and turning on their latches,
which hook the data. The problem with high speed busses is that
all of those ears on the parallel lines end up loading down the
lines, making it impossible to get really high speed performance.
This is the reason that the number of PCI slots on a PCI hub (a
gadget on a motherboard which distributes signals to the PCI bus)
goes down as the frequency goes up. A typical hub can support
four slots at 66 MHz, two slots at 100 MHz, but just one at 133
MHz. Although busses are an acceptable methodology when bus
frequencies are on the order of 33 to 66 MHz – they don’t
work with signals running above 100 MHz, which is where the
HyperTransport runs. So, another technique was needed to make it
possible to interface a number of devices, say four PCI slots
(each can be thought of as a device when something is plugged
into it). The technique used is to eliminate the snoopers. In
HyperTransport, the signal moves from point to point down a
series of chips, which act as bridges. When a packet of
information is intended for them, they pass it along to their
client device; otherwise they pass it along to the next HT device
in the chain. The typical time (I am using numbers here for
typical LVDS devices) required to make a hop using the LVDS “pass
it along” paradigm is on the order of 100 ns per hop. This
means if a part is on one end of an HT chain and it wants to talk
to one on the other end, it will take several hundred nanoseconds
for the message to run the gauntlet. However, properly buffered,
the added latency of the paradigm will not affect bandwidth, as
each part in the chain contains input and output buffers which
are set up as FIFOs. The case discussed above is dedicated to
performing I/O.
In the case of an Opteron
motherboard with four CPUs, the worst case latency is a two hop
transfer across the square. This adds just 70 ns. For
inter-processor communication the coherent HT bus is extremely
efficient. Even when used to drive a long chain of peripheral
devises, it is still very efficient, as the typical latencies of
I/O peripherals run from 10’s of microseconds all the way
out to milliseconds, which basically means that the half
microsecond you waste going down the chain is not observable.
Finally, we get to the real benefit
of coherent HyperTransport. In a ring of four processors, the
busses between the CPUs can be simultaneously active. This is the
same thing we saw with Transputers, where the most common
parallel topology turned out to be the ring. In the case of a
ring, the total bandwidth of the interconnect is the total number
of active connections times the speed of a single
interconnection. In the case of a four way ring, 66% of the
communications will be with the nearest neighbors, and 33% with
the diagonal member. And, if two or more processors are active at
the same time on the HT interconnect, it becomes possible to
effectively increase the inter-processor bandwidth. The only
disadvantage of the HT paradigm is inter-processor latency. Using
a single bus ala Intel to connect four CPUs together, only a
single processor can be on the bus at a time, and the latency
between that processor and the only thing it communicates with,
the Northbridge, will be less than the latency of a cross corner
transfer on the HT solution – communications can occur on a
bus without hops, reducing inter-processor latency. However, as
we mentioned above, the Opteron more than makes up for this by
virtue of the fact that each processor has its own private on
chip memory controller which eliminates the extra time a bus
architecture needs to access its memory controller hub.
While Transputers talked to each
other using links, Intel Xeons don't have any such feature.
Rather, when two Xeons talk to each other, they actually do that
by sharing memory. Therefore, if two Intel processors on the same
bus want to share information, what they actually have to do is
read and write memory through the Northbridge. On the other hand,
Opterons can access each other’s memory over the coherent
HyperTransport bus, and it turns out that these off chip fetches
only take between 100 and 140ns. This is roughly the same time
that a single Intel Xeon can access its memory (AMD claims the
latency of Athlon/Xeon motherboards with Northbridges are
typically 170ns). The bottom line is, because the pair of
Opterons communicating in cross court memory transactions each
have their own private memories and controllers, each can fetch
memory in about half the time of a Xeon or Athlon. In fact, it
will take the Opteron system less time (140ns) to perform a cross
court fetch than it takes a Xeon to make a simple fetch through
the motherboard. In other words, the Opteron has a very low
latency inter-processor connection. Where this really comes in
handy is in running SMP applications. True SMP problems that run
faster in a low latency shared memory environment, like those on
the heavy iron built by companies like Fujitsu, will run faster
on an Opteron cluster than any other commodity cluster, excepting
possibly the expensive HP Alpha EV79 based solution, which
features a similar architecture. To put things in perspective,
the best latency between nodes in an MPI cluster is typically on
the order of 10 microseconds. This is a factor of 100 greater
than the inter-processor latency of an Opteron cluster. This
means that an Opteron SMP cluster can execute parallel problems
whose granularity is a factor of 100 “finer” than
those executing on an MPI connected system.
Back
to Top
|