In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators

Articles > In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators
This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020).

Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs.

Computational “Ampere” GPU architecture – important features and changes:

  • Exceptional HPC performance:
    • 9.7 TFLOPS FP64 double-precision floating-point performance
    • Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
    • 19.5 TFLOPS FP32 single-precision floating-point performance
  • Exceptional AI deep learning training and inference performance:
    • TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
    • Sparse matrix optimizations potentially double training and inference performance
    • Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
    • Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
    • Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
  • High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
  • Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
  • 3rd-generation NVLink doubles transfer speeds between GPUs
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
  • Larger and Faster L1 Cache and Shared Memory for improved performance
  • Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
  • Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

Visualization “Ampere” GPU architecture – important features and changes:

  • Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths
    (previous generations provided one dedicated FP32 path and one dedicated Integer path)
  • 2nd-generation RT cores provide up to a 2x increase in raytracing performance
  • 3rd-generation Tensor Cores with TF32 and support for sparsity optimizations
  • 3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction
  • GDDR6X memory providing up to 768 GB/s of GPU memory throughput
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU

As stated above, the feature sets vary between the “computational” and the “visualization” GPU models. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. Please contact our team for additional review and discussion.

NVIDIA “Ampere” GPU Specifications

High Performance Computing & Deep Learning GPUs

The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert.

Feature NVIDIA A30 PCI-E NVIDIA A100 40GB PCI-E NVIDIA A100 80GB PCI-E NVIDIA A100 SXM4
GPU Chip Ampere GA100
TensorCore Performance*
10.3 TFLOPS FP64
82 TFLOPS † TF32
165 TFLOPS † FP16/BF16
330 TOPS † INT8
661 TOPS † INT4
17.6 ~ 19.5 TFLOPS FP64
140 ~ 156 TFLOPS † TF32
281 ~ 312 TFLOPS † FP16/BF16
562 ~ 624 TOPS † INT8
1,123 ~ 1,248 TOPS † INT4
19.5 TFLOPS FP64
156 TFLOPS † TF32
312 TFLOPS † FP16/BF16
624 TOPS † INT8
1,248 TOPS † INT4
Double Precision (FP64) Performance* 5.2 TFLOPS 8.7 ~ 9.7 TFLOPS 9.7 TFLOPS
Single Precision (FP32) Performance* 10.3 TFLOPS 17.6 ~ 19.5 TFLOPS 19.5 TFLOPS
Half Precision (FP16) Performance* 41 TFLOPS 70 ~ 78 TFLOPS 78 TFLOPS
Brain Floating Point (BF16) Performance* 20 TFLOPS 35 ~ 39 TFLOPS 39 TFLOPS
On-die Memory 24GB HBM2 40GB HBM2 80GB HBM2 40GB HBM2 or 80GB HBM2e
Memory Bandwidth 933 GB/s 1,555 GB/s 1,940 GB/s 1,555 GB/s for 40GB
2,039 GB/s for 80GB
L2 Cache 40MB
Interconnect NVLink 3.0 (4 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
NVLink 3.0 (12 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
NVLink 3.0 (12 bricks) + PCI-E 4.0
GPU-to-GPU transfer bandwidth (bidirectional) 200 GB/s 600 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported up to 4 up to 7
# of SM Units 56 108
# of Tensor Cores 224 432
# of integer INT32 CUDA Cores 3,584 6,912
# of single-precision FP32 CUDA Cores 3,584 6,912
# of double-precision FP64 CUDA Cores 1,792 3,456
GPU Base Clock 930 MHz 765 MHz 1065 MHz 1095 MHz
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1440 MHz 1410 MHz
Compute Capability 8.0
Workstation Support no
Server Support yes
Cooling Type Passive
Wattage (TDP) 165 250W 300W 400W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Visualization & Ray Tracing GPUs

The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for visualization and ray tracing. Note that these GPUs would not necessarily be connecting directly to a display device, but might be performing remote rendering from a datacenter.

To learn more about these GPUs and to review which are the best options for you, please speak with a GPU expert.

Feature NVIDIA RTX A5000 NVIDIA RTX A6000 NVIDIA A40
GPU Chip Ampere GA102
TensorCore Performance*
55.6 TFLOPS † TF32
111.1 TFLOPS † FP16/BF16
222.2 TOPS † INT8
444.4 TOPS † INT4
77.4 TFLOPS † TF32
154.8 TFLOPS † FP16/BF16
309.7 TOPS † INT8
619.3 TOPS † INT4
74.8 TFLOPS † TF32
149.7 TFLOPS † FP16/BF16
299.3 TOPS † INT8
598.7 TOPS † INT4
Double Precision (FP64) Performance* 0.4 TFLOPS 0.6 TFLOPS 0.6 TFLOPS
Single Precision (FP32) Performance* 27.8 TFLOPS 38.7 TFLOPS 37.4 TFLOPS
Integer (INT32) Performance* 13.9 TOPS 19.4 TOPS 18.7 TOPS
GPU Memory 24GB 48GB 48GB
Memory Bandwidth 768 GB/s 768 GB/s 696 GB/s
L2 Cache 6MB
Interconnect NVLink 3.0 + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
GPU-to-GPU transfer bandwidth (bidirectional) 112.5 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported N/A
# of SM Units 64 84
# of RT Cores 64 84
# of Tensor Cores 256 336
# of integer INT32 CUDA Cores 8,192 10,752
# of single-precision FP32 CUDA Cores 8,192 10,752
# of double-precision FP64 CUDA Cores 128 168
GPU Base Clock not published
GPU Boost Support Yes – Dynamic
GPU Boost Clock not published
Compute Capability 8.6
Workstation Support yes no
Server Support no yes
Cooling Type Active Passive
Wattage (TDP) 230W 300W 300W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Several lower-end graphics cards and datacenter GPUs are also available including RTX A2000, RTX A4000, A10, and A16. These GPUs offer similar capabilities, but with lower levels of performance and available at lower price points.

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Feature Pascal GP100 Volta GV100 Ampere GA100
Compute Capability* 6.0 7.0 8.0
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 32
Max Concurrent Kernels 128
32-bit Registers per SM 64 K
Max Registers per Block 64 K
Max Registers per Thread 255
Max Threads per Block 1024
L1 Cache Configuration 24KB
dedicated cache
32KB ~ 128KB
dynamic with shared memory
28KB ~ 192KB
dynamic with shared memory
Shared Memory Configurations 64KB configurable up to 96KB;
remainder for L1 Cache
(128KB total)
configurable up to 164KB;
remainder for L1 Cache
(192KB total)
Max Shared Memory per SM 64KB 96KB 164KB
Max Shared Memory per Thread Block 48KB 96KB 160KB
Max X Grid Dimension 232-1
Tensor Cores No Yes
Mixed Precision Warp-Matrix Functions No Yes
Hardware-accelerated async-copy No Yes
L2 Cache Residency Management No Yes
Dynamic Parallelism Yes
Unified Memory Yes
Preemption Yes

* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

The NVIDIA “Ampere” Datacenter GPUs that are designed for computational workloads do not include graphics acceleration features such as RT cores and hardware-accelerated video encoders. For example, RT cores for accelerated raytracing are not included in the A30 and A100 GPUs. Similarly, video encoding units (NVENC) are not included in these GPUs.

To accelerate computational workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

Category: Tags:

 

Comments are closed.