Important changes available in the “Pascal” GPU architecture include:
- Exceptional performance with up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance.
- NVLink enables a 5X increase in bandwidth between Tesla Pascal GPUs and from GPUs to supported system CPUs (compared with PCI-E).
- High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to Kepler and Maxwell GPUs.
- Pascal Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
- Up to 4MB L2 caches are available on Pascal GPUs (compared to 1.5MB on Kepler and 3MB on Maxwell).
- Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
- Energy-efficiency – Pascal GPUs deliver nearly twice the FLOPS per Watt as Kepler GPUs.
- Efficient SM units – Pascal’s architecture doubles the number of registers per thread
- Improved atomics in Pascal allow for an atomic add instruction in global memory (previous GPUs supported only shared memory atomics). Atomics can also be performed within the memory of other GPUs in the system.
- Half-precision FP support improves performance for low-precision operations (frequently used in neural network training)
- INT8 support improves performance for low-precision integer operations (frequently used in neural network inference)
- Compute Preemption allows higher-priority tasks to interrupt currently-running tasks.
Tesla “Pascal” GPU Specifications
The table below summarizes the features of the available Tesla Pascal GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
HPC Applications
Feature | Tesla P100 SXM2 16GB | Tesla P100 PCI-E 16GB | Tesla P100 PCI-E 12GB |
---|---|---|---|
GPU Chip(s) | Pascal GP100 | ||
Integer Operations (INT8)* | – | ||
Half Precision (FP16)* | 21.2 TFLOPS | 18.7 TFLOPS | |
Single Precision (FP32)* | 10.6 TFLOPS | 9.3 TFLOPS | |
Double Precision (FP64)* | 5.3 TFLOPS | 4.7 TFLOPS | |
On-die HBM2 Memory | 16GB | 12GB | |
Memory Bandwidth | 732 GB/s | 549 GB/s | |
L2 Cache | 4 MB | ||
Interconnect | NVLink + PCI-E 3.0 | PCI-Express 3.0 | |
Theoretical transfer bandwidth | 80 GB/s | 16 GB/s | |
Achievable transfer bandwidth | ~66 GB/s | ~12 GB/s | |
# of SM Units | 56 | ||
# of single-precision CUDA Cores | 3584 | ||
# of double-precision CUDA Cores | 1792 | ||
GPU Base Clock | 1328 MHz | 1126 MHz | |
GPU Boost Support | Yes – Dynamic | ||
GPU Boost Clock | 1480 MHz | 1303 MHz | |
Compute Capability | 6.0 | ||
Workstation Support | – | ||
Server Support | yes | ||
Wattage (TDP) | 300W | 250W |
* Measured with GPU Boost enabled
Deep Learning Applications
Feature | Tesla P40 PCI-E 24GB | ||
---|---|---|---|
GPU Chip(s) | Pascal GP102 | ||
Integer Operations (INT8)* | 47 TOPS | ||
Half Precision (FP16)* | – | ||
Single Precision (FP32)* | 12 TFLOPS | ||
Double Precision (FP64)* | – | ||
Onboard GDDR5 Memory | 24GB | ||
Memory Bandwidth | 346 GB/s | ||
L2 Cache | 3 MB | ||
Interconnect | PCI-Express 3.0 | ||
Theoretical transfer bandwidth | 16 GB/s | ||
Achievable transfer bandwidth | ~12 GB/s | ||
# of SM Units | 30 | ||
# of single-precision CUDA Cores | 3840 | ||
GPU Base Clock | 1303 MHz | ||
GPU Boost Support | Yes – Dynamic | ||
GPU Boost Clock | 1531 MHz | ||
Compute Capability | 6.1 | ||
Workstation Support | – | ||
Server Support | yes | ||
Wattage (TDP) | 250W |
* Measured with GPU Boost enabled
Comparison between “Kepler”, “Maxwell”, and “Pascal” GPU Architectures
Feature | Kepler GK210 | Maxwell GM200 | Maxwell GM204 | Pascal GP100 | Pascal GP102 |
---|---|---|---|---|---|
Compute Capability | 3.7 | 5.2 | 6.0 | 6.1 | |
Threads per Warp | 32 | ||||
Max Warps per SM | 64 | ||||
Max Threads per SM | 2048 | ||||
Max Thread Blocks per SM | 16 | 32 | |||
Max Concurrent Kernels | 32 | 128 | 32 | ||
32-bit Registers per SM | 128 K | 64 K | |||
Max Registers per Thread Block | 64 K | ||||
Max Registers per Thread | 255 | ||||
Max Threads per Thread Block | 1024 | ||||
L1 Cache Configuration | split with shared memory | 24KB dedicated L1 cache | |||
Shared Memory Configurations | 16KB + 112KB L1 Cache
32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total) |
96KB dedicated | 64KB dedicated | 96KB dedicated | |
Max Shared Memory per Thread Block | 48KB | ||||
Max X Grid Dimension | 232-1 | ||||
Hyper-Q | Yes | ||||
Dynamic Parallelism | Yes |
For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
Additional Tesla “Pascal” GPU products
NVIDIA has also released Tesla P4 GPUs. These GPUs are primarily for embedded and hyperscale deployments, and are not expected to be used in the HPC space.
Hardware-accelerated video encoding and decoding
All NVIDIA “Pascal” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.