For more information on other Tesla GPU architectures, please refer to:
Important changes available in the “Kepler” GPU architecture include:
- Dynamic parallelism supports GPU threads launching new threads. This simplifies parallel programming and avoids unnecessary communication between the GPU and the CPU.
- HyperQ enables up to 32 work queues per GPU. Multiple CPU cores and MPI processes are therefore able to address the GPU concurrently. Efficient utilization of the GPU resources is greatly improved.
- SMX architecture provides a new streaming multiprocessor design optimized for performance per watt. Each SM contains 192 CUDA cores (up from 32 cores in Fermi).
- PCI-Express generation 3.0 doubles data transfer rates between the host and the GPU.
- GPU Boost increases the clock speed of all CUDA cores, providing a 30+% performance boost for many common applications.
- Each SM contains more than twice as many registers (with another 2X on Tesla K80). Each thread may address four times as many registers.
- Shared Memory Bank width is doubled. Likewise, shared memory bandwidth is doubled. Tesla K80 features an additional 2X increase in shared memory size.
- Shuffle instructions allow threads to share data without use of shared memory.
“Kepler” Tesla GPU Specifications
The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
Currently-shipping Tesla 'Kepler' GPUs
Feature | Tesla K80 | Tesla K40 |
---|---|---|
GPU Chip(s) | 2x Kepler GK210 | Kepler GK110b |
Peak Single Precision (base clocks) | 5.60 TFLOPS (both GPUs combined) | 4.29 TFLOPS |
Peak Double Precision (base clocks) | 1.87 TFLOPS (both GPUs combined) | 1.43 TFLOPS |
Peak Single Precision (GPU Boost) | 8.73 TFLOPS (both GPUs combined) | 5.04 TFLOPS |
Peak Double Precision (GPU Boost) | 2.91 TFLOPS (both GPUs combined) | 1.68 TFLOPS |
Onboard GDDR5 Memory1 | 24GB (12GB per GPU) | 12 GB |
Memory Bandwidth1 | 480 GB/s (240 GB/s per GPU) | 288 GB/s |
PCI-Express Generation | 3.0 | |
Achievable PCI-E transfer bandwidth | 12 GB/s | 12 GB/s |
# of SMX Units | 26 (13 per GPU) | 15 |
# of CUDA Cores | 4992 (2496 per GPU) | 2880 |
Memory Clock | 2500 MHz | 3004 MHz |
GPU Base Clock | 560 MHz | 745 MHz |
GPU Boost Support | Yes – Dynamic | Yes – Static |
GPU Boost Clocks | 23 levels between 562 MHz and 875 MHz | 810 MHz 875 MHz |
Architecture features | SMX, Dynamic Parallelism, Hyper-Q | |
Compute Capability | 3.7 | 3.5 |
Workstation Support | – | Yes |
Server Support | Yes | |
Wattage (TDP) | 300W (plus Zero Power Idle) | 235W |
1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.
Previous Tesla 'Kepler' GPU Models
Feature | Tesla K20X | Tesla K20 | Tesla K10 |
---|---|---|---|
GPU Chip(s) | Kepler GK110 | 2x Kepler GK104 | |
Peak Single Precision | 3.95 TFLOPS | 3.52 TFLOPS | 2.3 TFLOPS per GPU |
Peak Double Precision | 1.32 TFLOPS | 1.17 TFLOPS | 95 GFLOPS per GPU |
Onboard GDDR5 Memory1 | 6GB | 5GB | 4GB per GPU |
Memory Bandwidth1 | 250 GB/s | 208 GB/s | 160 GB/s per GPU |
PCI-Express Generation | 2.0 | 3.0 | |
Achievable PCI-E transfer bandwidth | 6 GB/s | 11 GB/s | |
# of SMX Units | 14 | 13 | 8 per GPU |
# of CUDA Cores | 2688 | 2496 | 1536 per GPU |
Memory Clock | 2600 MHz | 2600 MHz | 2500 MHz |
GPU Base Clock | 732 MHz | 705 MHz | 745 MHz |
GPU Boost Support | Limited | – | – |
GPU Boost Clocks | 758 MHz 784 MHz |
– | – |
Architecture features | SMX, Dynamic Parallelism, Hyper-Q | SMX | |
Compute Capability | 3.5 | 3.0 | |
Workstation Support | – | Yes | – |
Server Support | Yes | ||
Wattage (TDP) | 235W | 225W |
1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.
Comparison between “Fermi” and “Kepler” GPU Architectures
Feature | Fermi GF100 | Fermi GF104 | Kepler GK104 | Kepler GK110(b) | Kepler GK210 |
---|---|---|---|---|---|
Compute Capability | 2.0 | 2.1 | 3.0 | 3.5 | 3.7 |
Threads per Warp | 32 | ||||
Max Warps per SM | 48 | 64 | |||
Max Threads per SM | 1536 | 2048 | |||
Max Thread Blocks per SM | 8 | 16 | |||
32-bit Registers per SM | 32 K | 64 K | 128 K | ||
Max Registers per Thread Block | 32 K | 64 K | |||
Max Registers per Thread | 63 | 255 | |||
Max Threads per Thread Block | 1024 | ||||
Shared Memory Configurations (remainder is configured as L1 Cache) |
16KB + 48KB L1 Cache
48KB + 16KB L1 Cache (64KB total) |
16KB + 48KB L1 Cache
32KB + 32KB L1 Cache 48KB + 16KB L1 Cache (64KB total) |
16KB + 112KB L1 Cache
32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total) |
||
Max Shared Memory per Thread Block | 48KB | ||||
Max X Grid Dimension | 216-1 | 232-1 | |||
Hyper-Q | – | – | – | Yes | |
Dynamic Parallelism | – | – | – | Yes |