NVIDIA Tesla K40 is now the leading Tesla GPU for performance. Here are some important use-cases where Tesla K40 might greatly accelerate your GPU-accelerated applications:
Pick Tesla K40 for
Large Data Sets
GPU memory has always been at a greater premium compared to its CPU equivalent. If you have a large data set, the 12GB of GDDR5 on Tesla K40 could be an excellent match for your application. Many common CUDA codes break apart data into chunks explicitly sized to the GPU memory space and start their compute algorithm on each chunk. This was previously limited to 5GB (Tesla K20) or 6GB (Tesla K20X). With Tesla K40, your chunks are twice the size:
|GPU||Tesla K20||Tesla K20X||Tesla K40|
|Memory Capacity||5GB GDDR5||6GB GDDR5||12GB GDDR5|
|Max Memory Bandwidth||208GB/sec||250GB/sec||288Gb/sec|
Frequent PCI-E Bus Transfers
Tesla K40 fully supports PCI-Express Gen3. CUDA codes that constantly move substantial data (>6GB/sec) across the PCI-E bus will benefit greatly from Gen3. We’ve seen Tesla K40 GPUs deliver up to 10GB/sec in NVIDIA’s CUDA bandwidth tests.
[root@node3 tests]# ./gpu_bandwidthTest --memory=pinned --device=0 [CUDA Bandwidth Test] - Starting... Running on... Device 0: Tesla K40m Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 10038.7 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 10046.7 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 202665.0 Result = PASS
Moreover, if your application transfers >6GB/sec to the GPU in small bursts, Tesla K40 will still deliver better performance. Assess the frequency of these transfers and the performance benefits of faster burst transfers to determine whether Tesla K40 is right for you. Tesla K40 is a must-buy when you can pair this capability with another benefit.
Fastest double precision performance
Tesla K40 adds the 15th and final SM to the Kepler GPU architecture. This increases the CUDA core count from to 2688 to 2880 SMX CUDA cores: GPU performance increases to 1.43 TFLOPS Double, 4.29 TFLOPS Single Precision.
|GPU||Tesla K20||Tesla K20X||Tesla K40|
|DP FLOPS||1.17 TFLOPS||1.32 TFLOPS||1.43 TFLOPS|
|SP FLOPS||3.52 TFLOPS||3.95 TFLOPS||4.29 TFLOPS|
If your code isn’t multi-GPU enabled, Tesla K40 will be the fastest double-precision device by far. In raw FLOPS that’s a 22% DP gain vs. Tesla K20; a more limited 9% gain over Tesla K20X. That’s before the new GPU Boost feature is utilized.
Applications that underutilize the GPU
GPU Boost 2.0 is one of the exciting new features exclusive to Tesla K40. NVIDIA engineers observed a number of applications that left Tesla GPUs well under full utilization and well under TDP.
Rather than leave this thermal headroom on the table, they engineered a GPU Boost feature specific to Tesla that allows you to convert power headroom into performance. GPU Boost 2.0 raises the 745Mhz base clock of your Tesla SMX CUDA cores to one of 2 user-selectable boost levels: Boost 1 @ 810Mhz and Boost 2 @ 875Mhz.
Should you approach TDP, Tesla K40 will automatically clock down to a lower boost state or base clock. Boost clocks levels are always deterministic, unlike GPU Boost for GeForce GPUs. This is extremely important for HPC applications: non-deterministic boost behavior in GeForce cards can result in variable compute times for each run or application segment.
For more information on how to manipulate GPU Boost, you may wish to read NVIDIA’s Application Note.
Consider another Tesla GPU when your focus is exclusively
FLOPS per dollar (Double-Precision)
- In Servers and Clusters running mixed workloads: Tesla K20X
- In a Workstation or WhisperStation: Tesla K20
FLOPS per dollar (Single-Precision)
For servers only, the Tesla K10 GPU may offer better performance for single-precision applications with high GPU memory bandwidth requirements.
Performance without a need for extensive GPU memory
If your memory requirements do not dictate Tesla K40, Tesla K20 or K20X may offer the best balance of features, performance and price. We would be happy to advise you.
Need Help Selecting a GPU for Your Application?
Microway HPC specialists leverage their extensive application experience to help walk you through the hardware selection process. We can help you assess which GPU is right for you and develop a custom hardware configuration that suits your needs.