DDR4 Memory on Xeon E5-2600v3 with 3 DIMMs per channel

This week I had the opportunity to run the STREAM memory benchmark on a Microway 2U NumberSmasher server which supports up to 3 DIMMs per channel.  In practice, this system is typically configured with 768GB or 1.5TB of DDR4 memory. A key goal of this benchmarking was to examine how RAM quantity and clock frequency affect bandwidth performance.  When fully loading all three DIMMs per channel, the memory frequency defaults to 1600MHz.  At two DIMMs per channel, the default memory frequency increases to 1866MHz.  With one DIMM per channel, the frequency maxes out at 2133MHz.

Photo of the Supermicro X10DRU-i motherboard

The Test System

System: NumberSmasher 2U Server based on SYS-6028U-TR4+
Motherboard: X10DRU-i+
Processors x 2: Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
DIMMs: 32GB DDR4-2133 ECC/Registered Samsung M393A4K40BB0-CPB0Q
Operating System: CentOS Linux release 7.2.1511 (Core)
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Compiler: Intel Parallel Studio XE 2016

Close-up photo of the Supermicro SYS-6028U-TR4 2U server supporting 3 DIMMs per channel

Benchmark Compilation and Execution

When compiling STREAM with the Intel compiler, I used the following compiler knobs in the makefile:

CC = icc
CFLAGS = -O3 -xHost -openmp -DSTREAM_ARRAY_SIZE=64000000 -opt-streaming-cache-evict=0 -opt-streaming-stores always -opt-prefetch-distance=64,8

Information on compiling STREAM can be found from an Intel Developer Zone article on STREAM Triad Optimization.  Also, reading through the STREAM FAQ at the University of Virginia site can be helpful.

I set the KMP_AFFINITY and OMP_NUM_THREADS environment variables before running STREAM:

export KMP_AFFINITY=granularity=core,compact

On a system that has hyper-threading turned on, I could have used GOMP_CPU_AFFINITY environment variable to focus on real cores, but I elected to turn off hyper-threading in BIOS instead.

STREAM Performance Results

Results with 3 DIMMs per Channel – 768GB RAM @ 1600MHz

Task Best Rate MB/s Avg time Min time Max time
Copy 73,876.7 0.013882 0.013861 0.013905
Scale 73,430.8 0.013967 0.013945 0.013989
Add 70,320.2 0.021891 0.021843 0.022147
Triad 70,555.8 0.021859 0.021770 0.022379

Results with 2 DIMMs per Channel – 512GB RAM @ 1866MHz

Task Best Rate MB/s Avg Time Min time Max time
Copy 88,413.8 0.011661 0.011582 0.011900
Scale 87,867.6 0.011765 0.011654 0.012166
Add 90,289.8 0.017417 0.017012 0.018789
Triad 89,756.5 0.017596 0.017113 0.018941

Results with 1 DIMM per Channel – 256GB RAM @ 2133MHz

Task Best Rate MB/s Avg time Min time Max time
Copy 89,242.5 0.011479 0.011468 0.011495
Scale 87,724.0 0.011699 0.011673 0.011757
Add 90,363.3 0.017031 0.016998 0.017057
Triad 90,411.5 0.017006 0.016989 0.017027

Plot of STREAM Triad memory performance for Intel Xeon E5-2637v3 CPUs with DDR4 Memory

Graph of STREAM Triad performance for 768GB, 512GB and 256GB memory

Summary of Results

Notice in the chart how rapidly performance improves moving from 3 DIMMs per channel 768GB at 1600MHz to 2 DIMMs per channel 512GB at 1866MHz.  Also notice that going from 2 DIMMs per channel to 1 DIMM per channel 256GB at 2133MHz does not change very much at all.

This is significant when deciding how much RAM to spec on a new system, or how much to add when upgrading. Outfitting a server with eight or sixteen DIMMs results in excellent performance. Outfitting a server with twenty-four DIMMs provides exceptional memory capacity, but results in reduced performance. Thus, there is a trade-off between memory capacity and memory performance.

Realize too that using the E5-2637 v3 processors – with only 4 real cores each – reduces the STREAM performance results.  Had I used something like the E5-2690 v3 processors – with 12 real cores each – the peak STREAM throughput results would be roughly 110GB/sec.

Results with 2 DIMMs per Channel – 512GB RAM @ 2133MHz (Forced in BIOS)

The best performance over all for the day (though not graphed above) came from forcing the 512GB configuration to 2133MHz in BIOS:

Task Best Rate MB/s Avg Time Min time Max time
Copy 89,510.2 0.011477 0.011440 0.011605
Scale 88,981.7 0.011523 0.011508 0.011539
Add 92,473.6 0.016640 0.016610 0.016665
Triad 92,403.3 0.016674 0.016623 0.016710

Be careful though – a configuration like this needs to be heavily tested to insure stability.  Call us at Microway if you are not sure or have questions about memory configuration on your next server.

Photo of the Supermicro SYS-6028U-TR4 2U server

This entry was posted in Benchmarking, Hardware and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *