Achieve the Best Performance: Intel Xeon E5-2600 “Sandy Bridge”

Intel has once again done an excellent job designing a high-performance processor. The new Xeon E5-2600 “Sandy Bridge EP” processors run as much as 2.2 times faster than the previous-generation Xeon 5600 “Westmere” processors. Combined with new Xeon server/workstation platforms, they will be extremely attractive to anyone with computationally-intensive needs.

The new Intel architecture provides many benefits right out of the box, while others may require changes on your end. Read on to make sure you’re achieving the best performance.

Intel Advanced Vector Extensions (AVX) Instructions

One of the largest performance improvements, as far as HPC is concerned, is AVX. Intel AVX accelerates vector and floating point computations by increasing maximum vector size from 128 to 256 bits. Essentially, the floating point capability of Intel processors has been doubled. Very exciting, but some work is required to take advantage of this improvement.

Your current applications will run on the new processors, but they will only use the first 128 bits. In most cases, all that’s required is re-compiling your application(s). However, you’ll need to use a compiler which supports the new AVX instructions. Additionally, the operating system needs support for the 256-bit wide unit.

For the operating system, you’ll need Linux kernel version 2.6.30 or later (or a vendor who has backported the features to their kernel, such as Red Hat). Windows users will need Windows 7 SP1 or Windows Server 2008 R2 SP1.

These are the best compiler options currently available:

  • Intel Composer XE (or the older Intel Compiler Suite version 11.1)
  • GCC version 4.6 or later
  • The Portland Group compiler 2011 version 11.6 (newer versions include further enhancements)
  • Microsoft Visual Studio 2010

PCI-Express generation 3.0 (Integrated I/O)

This is a major feature which comes for free on all of Microway’s new Intel Xeon systems. Having support for gen 3 PCI-Express will be highly desirable when the new Intel MIC and NVIDIA Tesla compute processor products are released. There is a ~2X bandwidth improvement between PCI-E generations 2 and 3, so anyone installing a gen 3 device in a gen 2 platform will be sacrificing significant performance.

Furthermore, Intel built the PCI-Express controller into the CPU itself. This removes one hop between the host and the PCI-E device, reducing latency by ~30%. Initial reports suggest that this change improves application performance by 10+%, even for PCI-E gen 2 devices!

Memory Speed and Capacity

HPC experts know that getting data to the processor is one of the most common bottlenecks. Improvements to memory are always welcome, and there are several in the new architecture. First, peak memory clock speeds have been increased to 1600MHz. Second, the older triple-channel controller has been replaced with a quad-channel controller. This allows for faster access to memory and a larger number of DIMMs (up to 24, depending upon the platform). Third, L3 cache sizes have been increased to 20MB.

However, not all of the new processors feature the fastest options so you have to make a choice of which model to purchase. There are three distinct performance levels:

  • Basic @ 1066MHz (10MB L3 cache)
  • Standard @ 1333MHz (15MB L3 cache)
  • Advanced @ 1600MHz (20MB L3 cache)

Given the slower performance, we do not recommend Basic models. For reference, here is a table of the processor SKUs:

List of all E5-2600 Processor Models

Turbo Boost 2.0

Turbo Boost allows the processor frequency to be temporarily increased as long as the processor is running within its power and thermal envelopes. This capability is enabled by default and is managed automatically by the CPU hardware. You don’t have to do anything to take advantage of the speedup, but understanding Turbo Boost behavior is useful.

When only a few cores of a multi-core chip are in use, the clock speeds of those cores are boosted significantly. When more cores are in use the clock is still boosted, but there is less margin for increases. With all cores in use it’s still possible to see a boost, but the increases will be smaller. Each boost level is 100MHz, but total turbo boost capacity varies from model to model. For the Standard and Advanced processor models, the boost levels range from 300MHz or 400MHz (when all cores in use) to as high as 800 or 900MHz (when only a single core is in use). The Basic models have essentially no boost capability.

According to Intel, processors with Turbo Boost 2.0 enter boost mode more frequently and stay there longer than previous models. Note that processors with Turbo Boost 2.0 may operate above TDP for short periods of time to maximize performance.

Quick Path Interconnect (QPI)

QPI provides communication between the processor sockets. In addition to higher clock speeds, the new Xeon platforms introduce a second link between sockets (more than doubling the potential communication between processor sockets). This provides significant benefits for multi-threaded/parallel applications which send large quantities of data between threads.

Additionally, there are other conditions during which the QPI links are used. For example, the second CPU may require access to memory or a PCI-Express device which is physically connected to the first CPU. All this communication will also pass across the QPI links – two fast buses reduce the likelihood of bottlenecks.

Much like the memory speed improvements, QPI speed varies by processor model. There are three distinct performance levels:

  • Basic @ 6.4 GT/s
  • Standard @ 7.2 GT/s
  • Advanced @ 8.0 GT/s

Refer to the processor SKU table above for complete details.

Hyperthreading

Hyperthreading has long been a part of the Intel processor designs. However, it has rarely shown benefit for computationally intensive applications. It doesn’t provide faster access to data or a larger number of math units, it simply allows additional threads to be in flight at the same time.

You will have to test your application to be certain it offers any benefit. With Hyperthreading enabled, the operating system will see twice as many processor cores as are actually in the hardware. You’ll want to run test jobs on both the real and virtual numbers of cores. Then disable Hyperthreading and run a test again using one thread for each real/physical processor core (Hyperthreading may be disabled from the BIOS). Typically, we do not see dramatic performance differences.

Conclusion

Overall, Microway’s Intel Xeon E5-2600 based workstations, servers and clusters provide many benefits out-of-the-box. Improvements to memory bandwidth, cache and QPI speeds don’t require any special changes on the part of the users, but careful analysis must be made during the purchasing process to choose the best option. HPC users will need to recompile their applications to take advantage of the 2X performance boost made possible by the AVX extensions. Those planning to use high-performance add-on cards, such as GPUs and MIC, should choose these new Xeon platforms to ensure the lowest-latency, highest-bandwidth path between the compute units and the host.

Eliot Eshelman

About Eliot Eshelman

My interests span from astrophysics to bacteriophages; high-performance computers to small spherical magnets. I've been an avid Linux geek (with a focus on HPC) for more than a decade. I work as Microway's Vice President of Strategic Accounts and HPC Initiatives.
This entry was posted in Benchmarking, Hardware and tagged . Bookmark the permalink.

2 Responses to Achieve the Best Performance: Intel Xeon E5-2600 “Sandy Bridge”

  1. Avatar Felipe R. Pasa says:

    Hi Eliot!

    Thanks for the post,

    I am trying to compile a system in C with the best CFLAGS with -march options, i have Xeon E5-2630.

    Do you have a tip to make this in correct way?

    sorry about my english.

    Thanks in advance!

Leave a Reply

Your email address will not be published. Required fields are marked *