In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Articles > In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators
This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.

Important features available in the “Volta” GPU architecture include:

  • Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
  • Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
  • Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
  • Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
  • NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
  • High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
  • Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
  • Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
  • Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads

Tesla “Volta” GPU Specifications

The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

HPC and Deep Learning Applications

Feature Tesla V100 SXM2 16GB/32GB Tesla V100 PCI-E 16GB/32GB Tesla V100S PCI-E 32GB Quadro GV100 32GB
GPU Chip(s) Volta GV100
TensorFLOPS 125 TFLOPS 112 TFLOPS 130 TFLOPS 118.5 TFLOPS
Integer Operations (INT8)* 62.8 TOPS 56.0 TOPS 65 TOPS 59.3 TOPS
Half Precision (FP16)* 31.4 TFLOPS 28 TFLOPS 32.8 TFLOPS 29.6 TFLOPS
Single Precision (FP32)* 15.7 TFLOPS 14.0 TFLOPS 16.4 TFLOPS 14.8 TFLOPS
Double Precision (FP64)* 7.8 TFLOPS 7.0 TFLOPS 8.2 TFLOPS 7.4 TFLOPS
On-die HBM2 Memory 16GB or 32GB 32GB
Memory Bandwidth 900 GB/s 1,134 GB/s 870 GB/s
L2 Cache 6 MB
Interconnect NVLink 2.0 (6 bricks) + PCI-E 3.0 PCI-Express 3.0 NVLink 2.0 (4 bricks) + PCI-E 3.0
Theoretical transfer bandwidth (bidirectional) 300 GB/s 32 GB/s 200 GB/s
Achievable transfer bandwidth 143.5 GB/s ~12 GB/s
# of SM Units 80
# of Tensor Cores 640
# of integer INT32 CUDA Cores 5120
# of single-precision FP32 CUDA Cores 5120
# of double-precision FP64 CUDA Cores 2560
GPU Base Clock not published 1245Mhz not published
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1530 MHz ~1380 MHz TBM
Compute Capability 7.0
Workstation Support yes
Server Support yes specific server models only
Cooling Type Passive Active
Wattage (TDP) 300W 250W

* theoretical peak performance with GPU Boost enabled

Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures

Feature Kepler GK210 Pascal GP100 Volta GV100
Compute Capability ^ 3.7 6.0 7.0
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 16 32
Max Concurrent Kernels 32 128
32-bit Registers per SM 128 K 64 K
Max Registers per Thread Block 64 K
Max Registers per Thread 255
Max Threads per Thread Block 1024
L1 Cache Configuration split with shared memory 24KB dedicated L1 cache 32KB ~ 128KB
(dynamic with shared memory)
Shared Memory Configurations 16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)

64KB configurable up to 96KB; remainder for L1 Cache

(128KB total)

Max Shared Memory per Thread Block 48KB 96KB*
Max X Grid Dimension 232-1
Hyper-Q Yes
Dynamic Parallelism Yes
Unified Memory No Yes
Pre-Emption No Yes

^ For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory

Hardware-accelerated video encoding and decoding

All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

Category: Tags:

 

Comments are closed.