In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators

Articles > In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators
This article provides in-depth details of the NVIDIA Tesla M-series GPU accelerators (codenamed “Maxwell”). “Maxwell” GPUs improve upon the previous-generation “Kepler” architecture, although they do not necessarily replace all “Kepler” models.

Important changes available in the “Maxwell” GPU architecture include:

  • Energy-efficiency – Maxwell GPUs deliver nearly twice the power-efficiency of Kepler GPUs.
  • SMM architecture – the Maxwell Multiprocessor (SMM) provides power-efficient performance, with 40% higher performance per CUDA core. Each SMM contains 128 CUDA cores (changed from 192 cores in Kepler).
  • Larger, dedicated shared memory in each SMM. The L1 cache is now separate from Shared Memory (they competed for space on Kepler).
  • Larger L2 caches are available on Maxwell GPUs (ranging from 2MB to 3MB, which is two to four times the size of L2 on Kepler).
  • Reduced latencies on GPU instructions improve utilization and throughput. Furthermore, the throughput of many Integer instructions has been improved.
  • Shared memory atomics improve upon Kepler’s device memory atomics by allowing threads to perform atomic operations on locations in shared memory.
  • Maximum active thread blocks are increased from 16 to 32 per SMM.
  • Dual NVENC H.264 encoders for increased throughput of video workloads. H.265 support is also added.

“Maxwell” Tesla GPU Specifications

The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Feature Tesla M40 Tesla M60
GPU Chip(s) Maxwell GM200 2x Maxwell GM204
Recommended Workload Machine Learning & Single-Precision apps Virtualized Desktops (VDI)
Peak Single Precision (GPU Boost) 6.84 TFLOPS 9.64 TFLOPS (both GPUs combined)
Peak Double Precision (GPU Boost) 0.213 TFLOPS 0.301 TFLOPS (both GPUs combined)
Onboard GDDR5 Memory1 12 GB or 24GB 16GB (8GB per GPU)
Memory Bandwidth1 288 GB/s 160 GB/s per GPU
L2 Cache 3 MB 2MB per GPU
PCI-Express Generation 3.0
Achievable PCI-E transfer bandwidth 12 GB/s
# of SMM Units 24 32 (16 per GPU)
# of CUDA Cores 3072 4096 (2048 per GPU)
Memory Clock 3004 MHz 2505 MHz
GPU Base Clock 948 MHz 899 MHz
GPU Boost Support Yes – Dynamic
GPU Boost Clocks 23 levels between 532 MHz and 1114 MHz 25 levels between 532 MHz and 1177 MHz
Compute Capability 5.2
Workstation Support
Server Support Yes
Wattage (TDP) 250W 300W

1. Measured with ECC disabled. Memory capacity and performance are reduced by 6.25% with ECC enabled.

Comparison between “Kepler” and “Maxwell” GPU Architectures

Feature Kepler GK104 Kepler GK110(b) Kepler GK210 Maxwell GM200 Maxwell GM204
Compute Capability 3.0 3.5 3.7 5.2
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 16 32
32-bit Registers per SM 64 K 128 K 64 K
Max Registers per Thread Block 64 K
Max Registers per Thread 255
Max Threads per Thread Block 1024
L1 Cache Configuration split with shared memory 24KB dedicated L1 cache
Shared Memory Configurations 16KB + 48KB L1 Cache

32KB + 32KB L1 Cache

48KB + 16KB L1 Cache

(64KB total)

16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)

96KB dedicated
Max Shared Memory per Thread Block 48KB
Max X Grid Dimension 232-1
Hyper-Q Yes
Dynamic Parallelism Yes

Additional Tesla “Maxwell” GPU products

NVIDIA has also released Tesla M4, Tesla M6, and Tesla M10 GPUs. These products are primarily for embedded and hyperscale deployments. These models are not expected to be used in the HPC space.

Category: Tags:


Comments are closed.