Important features available in the “Volta” GPU architecture include:
- Exceptional HPC performance with up to 7.8 TFLOPS double- and 15.7 TFLOPS single-precision floating-point performance.
- Deep Learning training performance with up to 125 TFLOPS FP16 half-precision floating-point performance.
- Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
- Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
- NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
- High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
- Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
- Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
- Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
- Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads
Tesla “Volta” GPU Specifications
The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
|Feature||Tesla V100 SXM2 16GB/32GB||Tesla V100 PCI-E 16GB/32GB||Quadro GV100 32GB|
|GPU Chip(s)||Volta GV100|
|TensorFLOPS||125 TFLOPS||112 TFLOPS||118.5 TFLOPS|
|Integer Operations (INT8)*||62.8 TOPS||56.0 TOPS||59.3 TOPS|
|Half Precision (FP16)*||31.4 TFLOPS||28 TFLOPS||29.6 TFLOPS|
|Single Precision (FP32)*||15.7 TFLOPS||14.0 TFLOPS||14.8 TFLOPS|
|Double Precision (FP64)*||7.8 TFLOPS||7.0 TFLOPS||7.4 TFLOPS|
|On-die HBM2 Memory||16GB or 32GB||32GB|
|Memory Bandwidth||900 GB/s||870 GB/s|
|L2 Cache||6 MB|
|Interconnect||NVLink 2.0 (6 bricks) + PCI-E 3.0||PCI-Express 3.0||NVLink 2.0 (4 bricks) + PCI-E 3.0|
|Theoretical transfer bandwidth (bidirectional)||300 GB/s||32 GB/s||200 GB/s|
|Achievable transfer bandwidth||143.5 GB/s||~12 GB/s||TBM|
|# of SM Units||80|
|# of Tensor Cores||640|
|# of integer INT32 CUDA Cores||5120|
|# of single-precision FP32 CUDA Cores||5120|
|# of double-precision FP64 CUDA Cores||2560|
|GPU Base Clock||not published||1245Mhz||not published|
|GPU Boost Support||Yes – Dynamic|
|GPU Boost Clock||1530 MHz||~1380 MHz||TBM|
|Server Support||yes||specific server models only|
* theoretical peak performance with GPU Boost enabled
Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures
|Feature||Kepler GK210||Pascal GP100||Volta GV100|
|Compute Capability ^||3.7||6.0||7.0|
|Threads per Warp||32|
|Max Warps per SM||64|
|Max Threads per SM||2048|
|Max Thread Blocks per SM||16||32|
|Max Concurrent Kernels||32||128|
|32-bit Registers per SM||128 K||64 K|
|Max Registers per Thread Block||64 K|
|Max Registers per Thread||255|
|Max Threads per Thread Block||1024|
|L1 Cache Configuration||split with shared memory||24KB dedicated L1 cache||32KB ~ 128KB
(dynamic with shared memory)
|Shared Memory Configurations||16KB + 112KB L1 Cache
32KB + 96KB L1 Cache
48KB + 80KB L1 Cache
|64KB||configurable up to 96KB; remainder for L1 Cache
|Max Shared Memory per Thread Block||48KB||96KB*|
|Max X Grid Dimension||232-1|
^ For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory
Hardware-accelerated video encoding and decoding
All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.