In-Depth Comparison of NVIDIA Tesla “Pascal” GPU Accelerators

Eliot Eshelman

·

July 1, 2016

This article provides in-depth details of the NVIDIA Tesla P-series GPU accelerators (codenamed “Pascal”). “Pascal” GPUs improve upon the previous-generation “Kepler”, and “Maxwell” architectures. Pascal GPUs were announced at GTC 2016 and began shipping in September 2016. Note: these have since been superseded by the NVIDIA Volta GPU architecture.

Important changes available in the “Pascal” GPU architecture include:

Exceptional performance with up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance.
NVLink enables a 5X increase in bandwidth between Tesla Pascal GPUs and from GPUs to supported system CPUs (compared with PCI-E).
High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to Kepler and Maxwell GPUs.
Pascal Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
Up to 4MB L2 caches are available on Pascal GPUs (compared to 1.5MB on Kepler and 3MB on Maxwell).
Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
Energy-efficiency – Pascal GPUs deliver nearly twice the FLOPS per Watt as Kepler GPUs.
Efficient SM units – Pascal’s architecture doubles the number of registers per thread
Improved atomics in Pascal allow for an atomic add instruction in global memory (previous GPUs supported only shared memory atomics). Atomics can also be performed within the memory of other GPUs in the system.
Half-precision FP support improves performance for low-precision operations (frequently used in neural network training)
INT8 support improves performance for low-precision integer operations (frequently used in neural network inference)
Compute Preemption allows higher-priority tasks to interrupt currently-running tasks.

Tesla “Pascal” GPU Specifications

The table below summarizes the features of the available Tesla Pascal GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

HPC Applications
Deep Learning Applications

Feature	Tesla P100 SXM2 16GB	Tesla P100 PCI-E 16GB	Tesla P100 PCI-E 12GB
GPU Chip(s)	Pascal GP100
Integer Operations (INT8)*	–
Half Precision (FP16)*	21.2 TFLOPS	18.7 TFLOPS
Single Precision (FP32)*	10.6 TFLOPS	9.3 TFLOPS
Double Precision (FP64)*	5.3 TFLOPS	4.7 TFLOPS
On-die HBM2 Memory	16GB		12GB
Memory Bandwidth	732 GB/s		549 GB/s
L2 Cache	4 MB
Interconnect	NVLink + PCI-E 3.0	PCI-Express 3.0
Theoretical transfer bandwidth	80 GB/s	16 GB/s
Achievable transfer bandwidth	~66 GB/s	~12 GB/s
# of SM Units	56
# of single-precision CUDA Cores	3584
# of double-precision CUDA Cores	1792
GPU Base Clock	1328 MHz	1126 MHz
GPU Boost Support	Yes – Dynamic
GPU Boost Clock	1480 MHz	1303 MHz
Compute Capability	6.0
Workstation Support	–
Server Support	yes
Wattage (TDP)	300W	250W

* Measured with GPU Boost enabled

Feature	Tesla P40 PCI-E 24GB
GPU Chip(s)	Pascal GP102
Integer Operations (INT8)*	47 TOPS
Half Precision (FP16)*	–
Single Precision (FP32)*	12 TFLOPS
Double Precision (FP64)*	–
Onboard GDDR5 Memory	24GB
Memory Bandwidth	346 GB/s
L2 Cache	3 MB
Interconnect	PCI-Express 3.0
Theoretical transfer bandwidth	16 GB/s
Achievable transfer bandwidth	~12 GB/s
# of SM Units	30
# of single-precision CUDA Cores	3840
GPU Base Clock	1303 MHz
GPU Boost Support	Yes – Dynamic
GPU Boost Clock	1531 MHz
Compute Capability	6.1
Workstation Support	–
Server Support	yes
Wattage (TDP)	250W

* Measured with GPU Boost enabled

Comparison between “Kepler”, “Maxwell”, and “Pascal” GPU Architectures

Feature	Kepler GK210	Maxwell GM200	Pascal GP100	Pascal GP102
Compute Capability	3.7	5.2	6.0	6.1
Threads per Warp	32
Max Warps per SM	64
Max Threads per SM	2048
Max Thread Blocks per SM	16	32
Max Concurrent Kernels	32		128	32
32-bit Registers per SM	128 K	64 K
Max Registers per Thread Block	64 K
Max Registers per Thread	255
Max Threads per Thread Block	1024
L1 Cache Configuration	split with shared memory		24KB dedicated L1 cache
Shared Memory Configurations	16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total)	96KB dedicated	64KB dedicated	96KB dedicated
Max Shared Memory per Thread Block	48KB
Max X Grid Dimension	2^32-1
Hyper-Q	Yes
Dynamic Parallelism	Yes

For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Additional Tesla “Pascal” GPU products

NVIDIA has also released Tesla P4 GPUs. These GPUs are primarily for embedded and hyperscale deployments, and are not expected to be used in the HPC space.

Hardware-accelerated video encoding and decoding

All NVIDIA “Pascal” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

In-Depth Comparison of NVIDIA Tesla “Pascal” GPU Accelerators

Important changes available in the “Pascal” GPU architecture include:

Tesla “Pascal” GPU Specifications

Comparison between “Kepler”, “Maxwell”, and “Pascal” GPU Architectures

Additional Tesla “Pascal” GPU products

Hardware-accelerated video encoding and decoding

You May Also Like

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs