In-Depth Comparison of NVIDIA Tesla “Pascal” GPU Accelerators

Articles > In-Depth Comparison of NVIDIA Tesla “Pascal” GPU Accelerators
This article provides in-depth details of the NVIDIA Tesla P-series GPU accelerators (codenamed “Pascal”). “Pascal” GPUs improve upon the previous-generation “Kepler”, and “Maxwell” architectures. Pascal GPUs were announced at GTC 2016 and began shipping in September 2016. Note: these have since been superseded by the NVIDIA Volta GPU architecture.

Important changes available in the “Pascal” GPU architecture include:

  • Exceptional performance with up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance.
  • NVLink enables a 5X increase in bandwidth between Tesla Pascal GPUs and from GPUs to supported system CPUs (compared with PCI-E).
  • High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to Kepler and Maxwell GPUs.
  • Pascal Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
  • Up to 4MB L2 caches are available on Pascal GPUs (compared to 1.5MB on Kepler and 3MB on Maxwell).
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
  • Energy-efficiency – Pascal GPUs deliver nearly twice the FLOPS per Watt as Kepler GPUs.
  • Efficient SM units – Pascal’s architecture doubles the number of registers per thread
  • Improved atomics in Pascal allow for an atomic add instruction in global memory (previous GPUs supported only shared memory atomics). Atomics can also be performed within the memory of other GPUs in the system.
  • Half-precision FP support improves performance for low-precision operations (frequently used in neural network training)
  • INT8 support improves performance for low-precision integer operations (frequently used in neural network inference)
  • Compute Preemption allows higher-priority tasks to interrupt currently-running tasks.

Tesla “Pascal” GPU Specifications

The table below summarizes the features of the available Tesla Pascal GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

HPC Applications

Feature Tesla P100 SXM2 16GB Tesla P100 PCI-E 16GB Tesla P100 PCI-E 12GB
GPU Chip(s) Pascal GP100
Integer Operations (INT8)*
Half Precision (FP16)* 21.2 TFLOPS 18.7 TFLOPS
Single Precision (FP32)* 10.6 TFLOPS 9.3 TFLOPS
Double Precision (FP64)* 5.3 TFLOPS 4.7 TFLOPS
On-die HBM2 Memory 16GB 12GB
Memory Bandwidth 732 GB/s 549 GB/s
L2 Cache 4 MB
Interconnect NVLink + PCI-E 3.0 PCI-Express 3.0
Theoretical transfer bandwidth 80 GB/s 16 GB/s
Achievable transfer bandwidth ~66 GB/s ~12 GB/s
# of SM Units 56
# of single-precision CUDA Cores 3584
# of double-precision CUDA Cores 1792
GPU Base Clock 1328 MHz 1126 MHz
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1480 MHz 1303 MHz
Compute Capability 6.0
Workstation Support
Server Support yes
Wattage (TDP) 300W 250W

* Measured with GPU Boost enabled

Deep Learning Applications

Feature Tesla P40 PCI-E 24GB
GPU Chip(s) Pascal GP102
Integer Operations (INT8)* 47 TOPS
Half Precision (FP16)*
Single Precision (FP32)* 12 TFLOPS
Double Precision (FP64)*
Onboard GDDR5 Memory 24GB
Memory Bandwidth 346 GB/s
L2 Cache 3 MB
Interconnect PCI-Express 3.0
Theoretical transfer bandwidth 16 GB/s
Achievable transfer bandwidth ~12 GB/s
# of SM Units 30
# of single-precision CUDA Cores 3840
GPU Base Clock 1303 MHz
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1531 MHz
Compute Capability 6.1
Workstation Support
Server Support yes
Wattage (TDP) 250W

* Measured with GPU Boost enabled


Comparison between “Kepler”, “Maxwell”, and “Pascal” GPU Architectures

Feature Kepler GK210 Maxwell GM200 Maxwell GM204 Pascal GP100 Pascal GP102
Compute Capability 3.7 5.2 6.0 6.1
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 16 32
Max Concurrent Kernels 32 128 32
32-bit Registers per SM 128 K 64 K
Max Registers per Thread Block 64 K
Max Registers per Thread 255
Max Threads per Thread Block 1024
L1 Cache Configuration split with shared memory 24KB dedicated L1 cache
Shared Memory Configurations 16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)

96KB dedicated 64KB dedicated 96KB dedicated
Max Shared Memory per Thread Block 48KB
Max X Grid Dimension 232-1
Hyper-Q Yes
Dynamic Parallelism Yes

For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation


Additional Tesla “Pascal” GPU products

NVIDIA has also released Tesla P4 GPUs. These GPUs are primarily for embedded and hyperscale deployments, and are not expected to be used in the HPC space.

Hardware-accelerated video encoding and decoding

All NVIDIA “Pascal” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

Category: Tags:

 

Comments are closed.