In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators

Articles > In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators
This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” architecture. Ampere A100 GPUs began shipping in May 2020. NVIDIA A100 80GB GPUs were announced in Nov. 2020

Important features and changes in the “Ampere” GPU architecture include:

  • Exceptional HPC performance:
    • 9.7 TFLOPS FP64 double-precision floating-point performance
    • Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
    • 19.5 TFLOPS FP32 single-precision floating-point performance
  • Exceptional AI deep learning training and inference performance:
    • TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
    • Sparse matrix optimizations potentially double training and inference performance
    • Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
    • Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
    • Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
  • High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
  • Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
  • 3rd-generation NVLink doubles transfer speeds between GPUs
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
  • Larger and Faster L1 Cache and Shared Memory for improved performance
  • Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
  • Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

NVIDIA “Ampere” A100 GPU Specifications

The table below summarizes the features of the available NVIDIA Ampere GPU Accelerators. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Feature NVIDIA A100 SXM4 NVIDIA A100 40GB PCI-Express
GPU Chip Ampere GA100
TensorCore Performance*
19.5 TFLOPS FP64
156 TFLOPS † TF32
312 TFLOPS † FP16/BF16
624 TOPS † INT8
1,248 TOPS † INT4
17.6 ~ 19.5 TFLOPS FP64
140 ~ 156 TFLOPS † TF32
281 ~ 312 TFLOPS † FP16/BF16
562 ~ 624 TOPS † INT8
1,123 ~ 1,248 TOPS † INT4
Double Precision (FP64) Performance* 9.7 TFLOPS 8.7 ~ 9.7 TFLOPS
Single Precision (FP32) Performance* 19.5 TFLOPS 17.6 ~ 19.5 TFLOPS
Half Precision (FP16) Performance* 78 TFLOPS 70 ~ 78 TFLOPS
Brain Floating Point (BF16) Performance* 39 TFLOPS 35 ~ 39 TFLOPS
On-die Memory 40GB HBM2 or 80GB HBM2e 40GB HBM2
Memory Bandwidth 1,555 GB/s for 40GB, 2,039 GB/s for 80GB 1,555 GB/s
L2 Cache 40MB
Interconnect NVLink 3.0 (12 bricks) + PCI-E 4.0 NVLink 3.0 (12 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
GPU-to-GPU transfer bandwidth (bidirectional) 600 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported up to 7
# of SM Units 108
# of Tensor Cores 432
# of integer INT32 CUDA Cores 6,912
# of single-precision FP32 CUDA Cores 6,912
# of double-precision FP64 CUDA Cores 3,456
GPU Base Clock 1095 MHz not published
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1410 MHz
Compute Capability 8.0
Workstation Support no
Server Support yes
Cooling Type Passive
Wattage (TDP) 400W 250W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Feature Pascal GP100 Volta GV100 Ampere GA100
Compute Capability* 6.0 7.0 8.0
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 32
Max Concurrent Kernels 128
32-bit Registers per SM 64 K
Max Registers per Block 64 K
Max Registers per Thread 255
Max Threads per Block 1024
L1 Cache Configuration 24KB
dedicated cache
32KB ~ 128KB
dynamic with shared memory
28KB ~ 192KB
dynamic with shared memory
Shared Memory Configurations 64KB configurable up to 96KB;
remainder for L1 Cache
(128KB total)
configurable up to 164KB;
remainder for L1 Cache
(192KB total)
Max Shared Memory per SM 64KB 96KB 164KB
Max Shared Memory per Thread Block 48KB 96KB 160KB
Max X Grid Dimension 232-1
Tensor Cores No Yes
Mixed Precision Warp-Matrix Functions No Yes
Hardware-accelerated async-copy No Yes
L2 Cache Residency Management No Yes
Dynamic Parallelism Yes
Unified Memory Yes
Preemption Yes

* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

The NVIDIA “Ampere” Datacenter GPUs have been designed for computational workloads rather than graphics workloads. RT cores for accelerated raytracing are not included in A100. Similarly, video encoding units (NVENC) are not included.

To accelerate workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

Category: Tags:


Comments are closed.