In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators

Articles > In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators
This article provides in-depth details of the NVIDIA Tesla K-series GPU accelerators (codenamed “Kepler”). “Kepler” GPUs improve upon the previous-generation “Fermi” architecture.

For more information on other Tesla GPU architectures, please refer to:

Important changes available in the “Kepler” GPU architecture include:

  • Dynamic parallelism supports GPU threads launching new threads. This simplifies parallel programming and avoids unnecessary communication between the GPU and the CPU.
  • HyperQ enables up to 32 work queues per GPU. Multiple CPU cores and MPI processes are therefore able to address the GPU concurrently. Efficient utilization of the GPU resources is greatly improved.
  • SMX architecture provides a new streaming multiprocessor design optimized for performance per watt. Each SM contains 192 CUDA cores (up from 32 cores in Fermi).
  • PCI-Express generation 3.0 doubles data transfer rates between the host and the GPU.
  • GPU Boost increases the clock speed of all CUDA cores, providing a 30+% performance boost for many common applications.
  • Each SM contains more than twice as many registers (with another 2X on Tesla K80). Each thread may address four times as many registers.
  • Shared Memory Bank width is doubled. Likewise, shared memory bandwidth is doubled. Tesla K80 features an additional 2X increase in shared memory size.
  • Shuffle instructions allow threads to share data without use of shared memory.

“Kepler” Tesla GPU Specifications

The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Feature Tesla K80 Tesla K40
GPU Chip(s) 2x Kepler GK210 Kepler GK110b
Peak Single Precision (base clocks) 5.60 TFLOPS (both GPUs combined) 4.29 TFLOPS
Peak Double Precision (base clocks) 1.87 TFLOPS (both GPUs combined) 1.43 TFLOPS
Peak Single Precision (GPU Boost) 8.73 TFLOPS (both GPUs combined) 5.04 TFLOPS
Peak Double Precision (GPU Boost) 2.91 TFLOPS (both GPUs combined) 1.68 TFLOPS
Onboard GDDR5 Memory1 24GB (12GB per GPU) 12 GB
Memory Bandwidth1 480 GB/s (240 GB/s per GPU) 288 GB/s
PCI-Express Generation 3.0
Achievable PCI-E transfer bandwidth 12 GB/s 12 GB/s
# of SMX Units 26 (13 per GPU) 15
# of CUDA Cores 4992 (2496 per GPU) 2880
Memory Clock 2500 MHz 3004 MHz
GPU Base Clock 560 MHz 745 MHz
GPU Boost Support Yes – Dynamic Yes – Static
GPU Boost Clocks 23 levels between 562 MHz and 875 MHz 810 MHz
875 MHz
Architecture features SMX, Dynamic Parallelism, Hyper-Q
Compute Capability 3.7 3.5
Workstation Support Yes
Server Support Yes
Wattage (TDP) 300W (plus Zero Power Idle) 235W

1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.

The models listed below are still available for sale in certain scenarios, but are not generally recommended. They offer lower performance than Tesla K40 or K80 (and do not cost any less).

Feature Tesla K20X Tesla K20 Tesla K10
GPU Chip(s) Kepler GK110 2x Kepler GK104
Peak Single Precision 3.95 TFLOPS 3.52 TFLOPS 2.3 TFLOPS per GPU
Peak Double Precision 1.32 TFLOPS 1.17 TFLOPS 95 GFLOPS per GPU
Onboard GDDR5 Memory1 6GB 5GB 4GB per GPU
Memory Bandwidth1 250 GB/s 208 GB/s 160 GB/s per GPU
PCI-Express Generation 2.0 3.0
Achievable PCI-E transfer bandwidth 6 GB/s 11 GB/s
# of SMX Units 14 13 8 per GPU
# of CUDA Cores 2688 2496 1536 per GPU
Memory Clock 2600 MHz 2600 MHz 2500 MHz
GPU Base Clock 732 MHz 705 MHz 745 MHz
GPU Boost Support Limited
GPU Boost Clocks 758 MHz
784 MHz
Architecture features SMX, Dynamic Parallelism, Hyper-Q SMX
Compute Capability 3.5 3.0
Workstation Support Yes
Server Support Yes
Wattage (TDP) 235W 225W

1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.

Comparison between “Fermi” and “Kepler” GPU Architectures

Feature Fermi GF100 Fermi GF104 Kepler GK104 Kepler GK110(b) Kepler GK210
Compute Capability 2.0 2.1 3.0 3.5 3.7
Threads per Warp 32
Max Warps per SM 48 64
Max Threads per SM 1536 2048
Max Thread Blocks per SM 8 16
32-bit Registers per SM 32 K 64 K 128 K
Max Registers per Thread Block 32 K 64 K
Max Registers per Thread 63 255
Max Threads per Thread Block 1024
Shared Memory Configurations
(remainder is configured as L1 Cache)
16KB + 48KB L1 Cache

48KB + 16KB L1 Cache

(64KB total)

16KB + 48KB L1 Cache

32KB + 32KB L1 Cache

48KB + 16KB L1 Cache

(64KB total)

16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)

Max Shared Memory per Thread Block 48KB
Max X Grid Dimension 216-1 232-1
Hyper-Q Yes
Dynamic Parallelism Yes
Category: Tags:

 

Comments are closed.