In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators

Eliot Eshelman

·

June 20, 2020

This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020).

Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs.

Computational “Ampere” GPU architecture – important features and changes:

Exceptional HPC performance:
- 9.7 TFLOPS FP64 double-precision floating-point performance
- Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
- 19.5 TFLOPS FP32 single-precision floating-point performance
Exceptional AI deep learning training and inference performance:
- TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
- Sparse matrix optimizations potentially double training and inference performance
- Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
- Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
- Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
3rd-generation NVLink doubles transfer speeds between GPUs
4th-generation PCI-Express doubles transfer speeds between the system and each GPU
Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
Larger and Faster L1 Cache and Shared Memory for improved performance
Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

Visualization “Ampere” GPU architecture – important features and changes:

Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths
(previous generations provided one dedicated FP32 path and one dedicated Integer path)
2nd-generation RT cores provide up to a 2x increase in raytracing performance
3rd-generation Tensor Cores with TF32 and support for sparsity optimizations
3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction
GDDR6X memory providing up to 768 GB/s of GPU memory throughput
4th-generation PCI-Express doubles transfer speeds between the system and each GPU

As stated above, the feature sets vary between the “computational” and the “visualization” GPU models. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. Please contact our team for additional review and discussion.

NVIDIA “Ampere” GPU Specifications

[tabby title=”High Performance Computing & Deep Learning GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert.

Feature

NVIDIA A30 PCI-E

NVIDIA A100 40GB PCI-E

NVIDIA A100 80GB PCI-E

NVIDIA A100 SXM4

GPU Chip

Ampere GA100

TensorCore Performance*

10.3 TFLOPS	FP64
82 TFLOPS †	TF32
165 TFLOPS †	FP16/BF16
330 TOPS †	INT8
661 TOPS †	INT4

17.6 ~ 19.5 TFLOPS	FP64
140 ~ 156 TFLOPS †	TF32
281 ~ 312 TFLOPS †	FP16/BF16
562 ~ 624 TOPS †	INT8
1,123 ~ 1,248 TOPS †	INT4

19.5 TFLOPS	FP64
156 TFLOPS †	TF32
312 TFLOPS †	FP16/BF16
624 TOPS †	INT8
1,248 TOPS †	INT4

Double Precision (FP64) Performance*

5.2 TFLOPS

8.7 ~ 9.7 TFLOPS

9.7 TFLOPS

Single Precision (FP32) Performance*

10.3 TFLOPS

17.6 ~ 19.5 TFLOPS

19.5 TFLOPS

Half Precision (FP16) Performance*

41 TFLOPS

70 ~ 78 TFLOPS

78 TFLOPS

Brain Floating Point (BF16) Performance*

20 TFLOPS

35 ~ 39 TFLOPS

39 TFLOPS

On-die Memory

24GB HBM2

40GB HBM2

80GB HBM2

40GB HBM2 or 80GB HBM2e

Memory Bandwidth

933 GB/s

1,555 GB/s

1,940 GB/s

1,555 GB/s for 40GB
2,039 GB/s for 80GB

L2 Cache

40MB

Interconnect

NVLink 3.0 (4 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards

NVLink 3.0 (12 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards

NVLink 3.0 (12 bricks) + PCI-E 4.0

GPU-to-GPU transfer bandwidth (bidirectional)

200 GB/s

600 GB/s

Host-to-GPU transfer bandwidth (bidirectional)

64 GB/s

# of MIG instances supported

up to 4

up to 7

# of SM Units

56

108

# of Tensor Cores

224

432

# of integer INT32 CUDA Cores

3,584

6,912

# of single-precision FP32 CUDA Cores

3,584

6,912

# of double-precision FP64 CUDA Cores

1,792

3,456

GPU Base Clock

930 MHz

765 MHz

1065 MHz

1095 MHz

GPU Boost Support

Yes – Dynamic

GPU Boost Clock

1440 MHz

1410 MHz

Compute Capability

8.0

Workstation Support

no

Server Support

yes

Cooling Type

Passive

Wattage (TDP)

165

250W

300W

400W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

[tabby title=”Visualization & Ray Tracing GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for visualization and ray tracing. Note that these GPUs would not necessarily be connecting directly to a display device, but might be performing remote rendering from a datacenter.

To learn more about these GPUs and to review which are the best options for you, please speak with a GPU expert.

Feature

NVIDIA RTX A5000

NVIDIA RTX A6000

NVIDIA A40

GPU Chip

Ampere GA102

TensorCore Performance*

55.6 TFLOPS †	TF32
111.1 TFLOPS †	FP16/BF16
222.2 TOPS †	INT8
444.4 TOPS †	INT4

77.4 TFLOPS †	TF32
154.8 TFLOPS †	FP16/BF16
309.7 TOPS †	INT8
619.3 TOPS †	INT4

74.8 TFLOPS †	TF32
149.7 TFLOPS †	FP16/BF16
299.3 TOPS †	INT8
598.7 TOPS †	INT4

Double Precision (FP64) Performance*

0.4 TFLOPS

0.6 TFLOPS

Single Precision (FP32) Performance*

27.8 TFLOPS

38.7 TFLOPS

37.4 TFLOPS

Integer (INT32) Performance*

13.9 TOPS

19.4 TOPS

18.7 TOPS

GPU Memory

24GB

48GB

Memory Bandwidth

768 GB/s

696 GB/s

L2 Cache

6MB

Interconnect

NVLink 3.0 + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards

GPU-to-GPU transfer bandwidth (bidirectional)

112.5 GB/s

Host-to-GPU transfer bandwidth (bidirectional)

64 GB/s

# of MIG instances supported

N/A

# of SM Units

64

84

# of RT Cores

64

84

# of Tensor Cores

256

336

# of integer INT32 CUDA Cores

8,192

10,752

# of single-precision FP32 CUDA Cores

8,192

10,752

# of double-precision FP64 CUDA Cores

128

168

GPU Base Clock

not published

GPU Boost Support

Yes – Dynamic

GPU Boost Clock

not published

Compute Capability

8.6

Workstation Support

yes

no

Server Support

no

yes

Cooling Type

Active

Passive

Wattage (TDP)

230W

300W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Several lower-end graphics cards and datacenter GPUs are also available including RTX A2000, RTX A4000, A10, and A16. These GPUs offer similar capabilities, but with lower levels of performance and available at lower price points.
[tabbyending]

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Feature	Pascal GP100	Volta GV100	Ampere GA100
Compute Capability*	6.0	7.0	8.0
Threads per Warp	32
Max Warps per SM	64
Max Threads per SM	2048
Max Thread Blocks per SM	32
Max Concurrent Kernels	128
32-bit Registers per SM	64 K
Max Registers per Block	64 K
Max Registers per Thread	255
Max Threads per Block	1024
L1 Cache Configuration	24KB dedicated cache	32KB ~ 128KB dynamic with shared memory	28KB ~ 192KB dynamic with shared memory
Shared Memory Configurations	64KB	configurable up to 96KB; remainder for L1 Cache (128KB total)	configurable up to 164KB; remainder for L1 Cache (192KB total)
Max Shared Memory per SM	64KB	96KB	164KB
Max Shared Memory per Thread Block	48KB	96KB	160KB
Max X Grid Dimension	2^32-1
Tensor Cores	No	Yes
Mixed Precision Warp-Matrix Functions	No	Yes
Hardware-accelerated async-copy	No		Yes
L2 Cache Residency Management	No		Yes
Dynamic Parallelism	Yes
Unified Memory	Yes
Preemption	Yes

* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

The NVIDIA “Ampere” Datacenter GPUs that are designed for computational workloads do not include graphics acceleration features such as RT cores and hardware-accelerated video encoders. For example, RT cores for accelerated raytracing are not included in the A30 and A100 GPUs. Similarly, video encoding units (NVENC) are not included in these GPUs.

To accelerate computational workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators

Computational “Ampere” GPU architecture – important features and changes:

Visualization “Ampere” GPU architecture – important features and changes:

NVIDIA “Ampere” GPU Specifications

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

You May Also Like

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs