NVIDIA Tesla K40 GPUs, the High Performance Choice for Many Applications

Brett Newman

·

January 7, 2014

NVIDIA Tesla K40 is now the leading Tesla GPU for performance. Here are some important use-cases where Tesla K40 might greatly accelerate your GPU-accelerated applications:

Pick Tesla K40 for

Large Data Sets

GPU memory has always been at a greater premium compared to its CPU equivalent. If you have a large data set, the 12GB of GDDR5 on Tesla K40 could be an excellent match for your application. Many common CUDA codes break apart data into chunks explicitly sized to the GPU memory space and start their compute algorithm on each chunk. This was previously limited to 5GB (Tesla K20) or 6GB (Tesla K20X). With Tesla K40, your chunks are twice the size:

GPU	Tesla K20	Tesla K20X	Tesla K40
Memory Capacity	5GB GDDR5	6GB GDDR5	12GB GDDR5
Max Memory Bandwidth	208GB/sec	250GB/sec	288Gb/sec

Frequent PCI-E Bus Transfers

Tesla K40 fully supports PCI-Express Gen3. CUDA codes that constantly move substantial data (>6GB/sec) across the PCI-E bus will benefit greatly from Gen3. We’ve seen Tesla K40 GPUs deliver up to 10GB/sec in NVIDIA’s CUDA bandwidth tests.

[root@node3 tests]# ./gpu_bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K40m
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10038.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10046.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     202665.0

Result = PASS

Moreover, if your application transfers >6GB/sec to the GPU in small bursts, Tesla K40 will still deliver better performance. Assess the frequency of these transfers and the performance benefits of faster burst transfers to determine whether Tesla K40 is right for you. Tesla K40 is a must-buy when you can pair this capability with another benefit.

Fastest double precision performance

Tesla K40 adds the 15th and final SM to the Kepler GPU architecture. This increases the CUDA core count from to 2688 to 2880 SMX CUDA cores: GPU performance increases to 1.43 TFLOPS Double, 4.29 TFLOPS Single Precision.

GPU	Tesla K20	Tesla K20X	Tesla K40
DP FLOPS	1.17 TFLOPS	1.32 TFLOPS	1.43 TFLOPS
SP FLOPS	3.52 TFLOPS	3.95 TFLOPS	4.29 TFLOPS

If your code isn’t multi-GPU enabled, Tesla K40 will be the fastest double-precision device by far. In raw FLOPS that’s a 22% DP gain vs. Tesla K20; a more limited 9% gain over Tesla K20X. That’s before the new GPU Boost feature is utilized.

Applications that underutilize the GPU

GPU Boost 2.0 is one of the exciting new features exclusive to Tesla K40. NVIDIA engineers observed a number of applications that left Tesla GPUs well under full utilization and well under TDP.

Rather than leave this thermal headroom on the table, they engineered a GPU Boost feature specific to Tesla that allows you to convert power headroom into performance. GPU Boost 2.0 raises the 745Mhz base clock of your Tesla SMX CUDA cores to one of 2 user-selectable boost levels: Boost 1 @ 810Mhz and Boost 2 @ 875Mhz.

Should you approach TDP, Tesla K40 will automatically clock down to a lower boost state or base clock. Boost clocks levels are always deterministic, unlike GPU Boost for GeForce GPUs. This is extremely important for HPC applications: non-deterministic boost behavior in GeForce cards can result in variable compute times for each run or application segment.

For more information on how to manipulate GPU Boost, you may wish to read NVIDIA’s Application Note.

Consider another Tesla GPU when your focus is exclusively

FLOPS per dollar (Double-Precision)

In Servers and Clusters running mixed workloads: Tesla K20X
In a Workstation or WhisperStation: Tesla K20

FLOPS per dollar (Single-Precision)

For servers only, the Tesla K10 GPU may offer better performance for single-precision applications with high GPU memory bandwidth requirements.

Performance without a need for extensive GPU memory

If your memory requirements do not dictate Tesla K40, Tesla K20 or K20X may offer the best balance of features, performance and price. We would be happy to advise you.

Need Help Selecting a GPU for Your Application?

Microway HPC specialists leverage their extensive application experience to help walk you through the hardware selection process. We can help you assess which GPU is right for you and develop a custom hardware configuration that suits your needs.