In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Eliot Eshelman

March 12, 2018

This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.

Important features available in the “Volta” GPU architecture include:

Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads

Tesla “Volta” GPU Specifications

The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

HPC and Deep Learning Applications

Feature	Tesla V100 SXM2 16GB/32GB	Tesla V100 PCI-E 16GB/32GB	Tesla V100S PCI-E 32GB	Quadro GV100 32GB
GPU Chip(s)	Volta GV100
TensorFLOPS	125 TFLOPS	112 TFLOPS	130 TFLOPS	118.5 TFLOPS
Integer Operations (INT8)*	62.8 TOPS	56.0 TOPS	65 TOPS	59.3 TOPS
Half Precision (FP16)*	31.4 TFLOPS	28 TFLOPS	32.8 TFLOPS	29.6 TFLOPS
Single Precision (FP32)*	15.7 TFLOPS	14.0 TFLOPS	16.4 TFLOPS	14.8 TFLOPS
Double Precision (FP64)*	7.8 TFLOPS	7.0 TFLOPS	8.2 TFLOPS	7.4 TFLOPS
On-die HBM2 Memory	16GB or 32GB		32GB
Memory Bandwidth	900 GB/s		1,134 GB/s	870 GB/s
L2 Cache	6 MB
Interconnect	NVLink 2.0 (6 bricks) + PCI-E 3.0	PCI-Express 3.0		NVLink 2.0 (4 bricks) + PCI-E 3.0
Theoretical transfer bandwidth (bidirectional)	300 GB/s	32 GB/s		200 GB/s
Achievable transfer bandwidth	143.5 GB/s	~12 GB/s
# of SM Units	80
# of Tensor Cores	640
# of integer INT32 CUDA Cores	5120
# of single-precision FP32 CUDA Cores	5120
# of double-precision FP64 CUDA Cores	2560
GPU Base Clock	not published	1245Mhz	not published
GPU Boost Support	Yes – Dynamic
GPU Boost Clock	1530 MHz	~1380 MHz	TBM
Compute Capability	7.0
Workstation Support	–			yes
Server Support	yes			specific server models only
Cooling Type	Passive			Active
Wattage (TDP)	300W	250W

* theoretical peak performance with GPU Boost enabled

Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures

Feature	Kepler GK210	Pascal GP100	Volta GV100
Compute Capability ^	3.7	6.0	7.0
Threads per Warp	32
Max Warps per SM	64
Max Threads per SM	2048
Max Thread Blocks per SM	16	32
Max Concurrent Kernels	32	128
32-bit Registers per SM	128 K	64 K
Max Registers per Thread Block	64 K
Max Registers per Thread	255
Max Threads per Thread Block	1024
L1 Cache Configuration	split with shared memory	24KB dedicated L1 cache	32KB ~ 128KB (dynamic with shared memory)
Shared Memory Configurations	16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total)	64KB	configurable up to 96KB; remainder for L1 Cache (128KB total)
Max Shared Memory per Thread Block	48KB		96KB*
Max X Grid Dimension	2^32-1
Hyper-Q	Yes
Dynamic Parallelism	Yes
Unified Memory	No	Yes
Pre-Emption	No	Yes

&Hat; For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory

Hardware-accelerated video encoding and decoding

All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs