In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators

Eliot Eshelman

March 4, 2016

This article provides in-depth details of the NVIDIA Tesla M-series GPU accelerators (codenamed “Maxwell”). “Maxwell” GPUs improve upon the previous-generation “Kepler” architecture, although they do not necessarily replace all “Kepler” models.

Important changes available in the “Maxwell” GPU architecture include:

Energy-efficiency – Maxwell GPUs deliver nearly twice the power-efficiency of Kepler GPUs.
SMM architecture – the Maxwell Multiprocessor (SMM) provides power-efficient performance, with 40% higher performance per CUDA core. Each SMM contains 128 CUDA cores (changed from 192 cores in Kepler).
Larger, dedicated shared memory in each SMM. The L1 cache is now separate from Shared Memory (they competed for space on Kepler).
Larger L2 caches are available on Maxwell GPUs (ranging from 2MB to 3MB, which is two to four times the size of L2 on Kepler).
Reduced latencies on GPU instructions improve utilization and throughput. Furthermore, the throughput of many Integer instructions has been improved.
Shared memory atomics improve upon Kepler’s device memory atomics by allowing threads to perform atomic operations on locations in shared memory.
Maximum active thread blocks are increased from 16 to 32 per SMM.
Dual NVENC H.264 encoders for increased throughput of video workloads. H.265 support is also added.

“Maxwell” Tesla GPU Specifications

The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Feature	Tesla M40	Tesla M60
GPU Chip(s)	Maxwell GM200	2x Maxwell GM204
Recommended Workload	Machine Learning & Single-Precision apps	Virtualized Desktops (VDI)
Peak Single Precision (GPU Boost)	6.84 TFLOPS	9.64 TFLOPS (both GPUs combined)
Peak Double Precision (GPU Boost)	0.213 TFLOPS	0.301 TFLOPS (both GPUs combined)
Onboard GDDR5 Memory¹	12 GB or 24GB	16GB (8GB per GPU)
Memory Bandwidth¹	288 GB/s	160 GB/s per GPU
L2 Cache	3 MB	2MB per GPU
PCI-Express Generation	3.0
Achievable PCI-E transfer bandwidth	12 GB/s
# of SMM Units	24	32 (16 per GPU)
# of CUDA Cores	3072	4096 (2048 per GPU)
Memory Clock	3004 MHz	2505 MHz
GPU Base Clock	948 MHz	899 MHz
GPU Boost Support	Yes – Dynamic
GPU Boost Clocks	23 levels between 532 MHz and 1114 MHz	25 levels between 532 MHz and 1177 MHz
Compute Capability	5.2
Workstation Support	–
Server Support	Yes
Wattage (TDP)	250W	300W

1. Measured with ECC disabled. Memory capacity and performance are reduced by 6.25% with ECC enabled.

Comparison between “Kepler” and “Maxwell” GPU Architectures

Feature	Kepler GK104	Kepler GK110(b)	Kepler GK210	Maxwell GM200
Compute Capability	3.0	3.5	3.7	5.2
Threads per Warp	32
Max Warps per SM	64
Max Threads per SM	2048
Max Thread Blocks per SM	16			32
32-bit Registers per SM	64 K		128 K	64 K
Max Registers per Thread Block	64 K
Max Registers per Thread	255
Max Threads per Thread Block	1024
L1 Cache Configuration	split with shared memory			24KB dedicated L1 cache
Shared Memory Configurations	16KB + 48KB L1 Cache 32KB + 32KB L1 Cache 48KB + 16KB L1 Cache (64KB total)		16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total)	96KB dedicated
Max Shared Memory per Thread Block	48KB
Max X Grid Dimension	2^32-1
Hyper-Q	Yes
Dynamic Parallelism	Yes

Additional Tesla “Maxwell” GPU products

NVIDIA has also released Tesla M4, Tesla M6, and Tesla M10 GPUs. These products are primarily for embedded and hyperscale deployments. These models are not expected to be used in the HPC space.

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs