In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators

Revision for “In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators” created on March 24, 2021 @ 13:05:15

TitleContentExcerpt
In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators
<em>This article provides details on the NVIDIA A-series GPUs (codenamed "Ampere"). "Ampere" GPUs improve upon the previous-generation <a href="https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/" rel="noopener noreferrer" target="_blank">"Volta"</a> architecture. Ampere A100 GPUs began shipping in May 2020. NVIDIA A100 80GB GPUs were announced in Nov. 2020</em>

<h2>Important features and changes in the "Ampere" GPU architecture include:</h2>
<ul>
<li><strong>Exceptional HPC performance:</strong>
<ul>
<li>9.7 TFLOPS FP64 double-precision floating-point performance</li>
<li>Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support</li>
<li>19.5 TFLOPS FP32 single-precision floating-point performance</li>
</ul></li>
<li><strong>Exceptional AI deep learning training and inference performance:</strong>
<ul>
<li><strong>TensorFloat 32 (TF32)</strong> instructions improve performance without loss of accuracy</li>
<li><strong>Sparse matrix optimizations</strong> potentially double training and inference performance</li>
<li>Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)</li>
<li>Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)</li>
<li>Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1</li>
</ul></li>
<li><strong>High-speed HBM2 Memory</strong> delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput</li>
<li><strong>Multi-Instance GPU</strong> allows each A100 GPU to run seven separate/isolated applications</li>
<li><strong>3rd-generation NVLink</strong> doubles transfer speeds between GPUs</li>
<li><strong>4th-generation PCI-Express</strong> doubles transfer speeds between the system and each GPU</li>
<li><strong>Native ECC Memory</strong> detects and corrects memory errors without any capacity or performance overhead</li>
<li><strong>Larger and Faster L1 Cache and Shared Memory</strong> for improved performance</li>
<li><strong>Improved L2 Cache</strong> is twice as fast and nearly seven times as large as L2 on Tesla V100</li>
<li><strong>Compute Data Compression</strong> accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.</li>
</ul>

<h2>NVIDIA "Ampere" A100 GPU Specifications</h2>
The table below summarizes the features of the available NVIDIA Ampere GPU Accelerators. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an <a href="http://www.microway.com/contact/" title="Talk to an Expert – Contact Microway">HPC expert</a>.

<table>
<thead>
<tr>
<th>Feature</th>
<th>NVIDIA A100 SXM4</th>
<th>NVIDIA A100 40GB PCI-Express</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">GPU Chip</td><td colspan=2>Ampere GA100</td></tr>
<tr><td class="rowhead">TensorCore Performance*</td><td class="table-in-table">
<table><tr><td>19.5 TFLOPS</td><td>FP64</td></tr><tr><td>156 TFLOPS &dagger;</td><td>TF32</td></tr><tr><td>312 TFLOPS &dagger;</td><td>FP16/BF16</td></tr><tr><td>624 TOPS &dagger;</td><td>INT8</td></tr><tr><td>1,248 TOPS &dagger;</td><td>INT4</td></tr></table>
</td>
<td class="table-in-table">
<table><tr><td>17.6 ~ 19.5 TFLOPS</td><td>FP64</td></tr><tr><td>140 ~ 156 TFLOPS &dagger;</td><td>TF32</td></tr><tr><td>281 ~ 312 TFLOPS &dagger;</td><td>FP16/BF16</td></tr><tr><td>562 ~ 624 TOPS &dagger;</td><td>INT8</td></tr><tr><td>1,123 ~ 1,248 TOPS &dagger;</td><td>INT4</td></tr></table>
</td>
</tr>
<tr><td class="rowhead">Double Precision (FP64) Performance*</td><td>9.7 TFLOPS</td><td>8.7 ~ 9.7 TFLOPS</td></tr>
<tr><td class="rowhead">Single Precision (FP32) Performance*</td><td>19.5 TFLOPS</td><td>17.6 ~ 19.5 TFLOPS</td></tr>
<tr><td class="rowhead">Half Precision (FP16) Performance*</td><td>78 TFLOPS</td><td>70 ~ 78 TFLOPS</td></tr>
<tr><td class="rowhead">Brain Floating Point (BF16) Performance*</td><td>39 TFLOPS</td><td>35 ~ 39 TFLOPS</td></tr>
<tr><td class="rowhead">On-die Memory</td><td>40GB HBM2 or 80GB HBM2e</td><td>40GB HBM2</td></tr>
<tr><td class="rowhead">Memory Bandwidth</td><td>1,555 GB/s for 40GB, 2,039 GB/s for 80GB</td><td>1,555 GB/s</td></tr>
<tr><td class="rowhead">L2 Cache</td><td colspan=2>40MB</td></tr>
<tr><td class="rowhead">Interconnect</td><td>NVLink 3.0 (12 bricks) + PCI-E 4.0</td><td>NVLink 3.0 (12 bricks) + PCI-E 4.0<br /><em>NVLink is limited to pairs of directly-linked cards</em></td></tr>
<tr><td class="rowhead">GPU-to-GPU transfer bandwidth (bidirectional)</td><td colspan=2>600 GB/s</td></tr>
<tr><td class="rowhead">Host-to-GPU transfer bandwidth (bidirectional)</td><td colspan=2>64 GB/s</td></tr>
<tr><td class="rowhead"># of MIG instances supported</td><td colspan=2>up to 7</td></tr>
<tr><td class="rowhead"># of SM Units</td><td colspan=2>108</td></tr>
<tr><td class="rowhead"># of Tensor Cores</td><td colspan=2>432</td></tr>
<tr><td class="rowhead"># of integer INT32 CUDA Cores</td><td colspan=2>6,912</td></tr>
<tr><td class="rowhead"># of single-precision FP32 CUDA Cores</td><td colspan=2>6,912</td></tr>
<tr><td class="rowhead"># of double-precision FP64 CUDA Cores</td><td colspan=2>3,456</td></tr>
<tr><td class="rowhead">GPU Base Clock</td><td>1095 MHz</td><td>not published</td></tr>
<tr><td class="rowhead">GPU Boost Support</td><td colspan=2>Yes – Dynamic</td></tr>
<tr><td class="rowhead">GPU Boost Clock</td><td colspan=2>1410 MHz</td></tr>
<tr><td class="rowhead">Compute Capability</td><td colspan=2>8.0</td></tr>
<tr><td class="rowhead">Workstation Support</td><td colspan=2>no</td></tr>
<tr><td class="rowhead">Server Support</td><td colspan=2>yes</td></tr>
<tr><td class="rowhead">Cooling Type</td><td colspan=2>Passive</td></tr>
<tr><td class="rowhead">Wattage (TDP)</td><td>400W</td><td>250W</td></tr>
</tbody>
</table>
<em>* theoretical peak performance based on GPU boost clock</em>
<em>&dagger; an additional 2X performance can be achieved via NVIDIA’s new sparsity feature</em>

<h2>Comparison between "Pascal", "Volta", and "Ampere" GPU Architectures</h2>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Pascal GP100</th>
<th>Volta GV100</th>
<th>Ampere GA100</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">Compute Capability*</td><td>6.0</td><td>7.0</td><td>8.0</td></tr>
<tr><td class="rowhead">Threads per Warp</td><td colspan=3>32</td></tr>
<tr><td class="rowhead">Max Warps per SM</td><td colspan=3>64</td></tr>
<tr><td class="rowhead">Max Threads per SM</td><td colspan=3>2048</td></tr>
<tr><td class="rowhead">Max Thread Blocks per SM</td><td colspan=3>32</td></tr>
<tr><td class="rowhead">Max Concurrent Kernels</td><td colspan=3>128</td></tr>
<tr><td class="rowhead">32-bit Registers per SM</td><td colspan=3>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Block</td><td colspan=3>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread</td><td colspan=3>255</td></tr>
<tr><td class="rowhead">Max Threads per Block</td><td colspan=3>1024</td></tr>
<tr><td class="rowhead">L1 Cache Configuration</td><td>24KB<br /><em>dedicated cache</em></td><td>32KB ~ 128KB<br /><em>dynamic with shared memory</em></td><td>28KB ~ 192KB<br /><em>dynamic with shared memory</em></td></tr>
<tr><td class="rowhead">Shared Memory Configurations</td><td>64KB</td><td>configurable up to 96KB;<br />remainder for L1 Cache<br /><em>(128KB total)</em></td><td>configurable up to 164KB;<br />remainder for L1 Cache<br /><em>(192KB total)</em></td></tr>
<tr><td class="rowhead">Max Shared Memory per SM</td><td>64KB</td><td>96KB</td><td>164KB</td></tr>
<tr><td class="rowhead">Max Shared Memory per Thread Block</td><td>48KB</td><td>96KB</td><td>160KB</td></tr>
<tr><td class="rowhead">Max X Grid Dimension</td><td colspan=3>2<sup>32-1</sup></td></tr>
<tr><td class="rowhead">Tensor Cores</td><td>No</td><td colspan=2>Yes</td></tr>
<tr><td class="rowhead">Mixed Precision Warp-Matrix Functions</td><td>No</td><td colspan=2>Yes</td></tr>
<tr><td class="rowhead">Hardware-accelerated async-copy</td><td colspan=2>No</td><td>Yes</td></tr>
<tr><td class="rowhead">L2 Cache Residency Management</td><td colspan=2>No</td><td>Yes</td></tr>
<tr><td class="rowhead">Dynamic Parallelism</td><td colspan=3>Yes</td></tr>
<tr><td class="rowhead">Unified Memory</td><td colspan=3>Yes</td></tr>
<tr><td class="rowhead">Preemption</td><td colspan=3>Yes</td></tr>
</tbody>
</table>
<em>* For a complete listing of Compute Capabilities, reference the <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability" target="_blank" rel="noopener noreferrer">NVIDIA CUDA Documentation</a></em>

<h2>Hardware-accelerated raytracing, video encoding, video decoding, and image decoding</h2>
The NVIDIA "Ampere" Datacenter GPUs have been designed for computational workloads rather than graphics workloads. RT cores for accelerated raytracing are not included in A100. Similarly, video encoding units (NVENC) are not included.

To accelerate workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s <a href="https://devblogs.nvidia.com/improving-computer-vision-with-nvidia-a100-gpus/" rel="noopener noreferrer" target="_blank">A100 for computer vision</a> blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s <a href="https://developer.nvidia.com/video-encode-decode-gpu-support-matrix" target="_blank" rel="noopener noreferrer">encoder/decoder support matrix</a>. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s <a href="https://developer.nvidia.com/nvidia-video-codec-sdk" target="_blank" rel="noopener noreferrer">Video Codec SDK</a>.



Old New Date Created Author Actions
March 24, 2021 @ 13:05:15 Brett Newman
March 24, 2021 @ 13:05:02 [Autosave] Brett Newman
March 24, 2021 @ 13:04:54 Brett Newman
March 24, 2021 @ 13:03:38 Brett Newman
March 24, 2021 @ 13:03:14 Brett Newman
June 24, 2020 @ 10:46:22 Eliot Eshelman
June 22, 2020 @ 12:13:23 Eliot Eshelman
June 22, 2020 @ 11:58:37 Eliot Eshelman
June 21, 2020 @ 21:52:21 Eliot Eshelman
June 20, 2020 @ 18:09:44 Eliot Eshelman
June 20, 2020 @ 18:03:56 Eliot Eshelman
June 20, 2020 @ 17:59:26 Eliot Eshelman
June 20, 2020 @ 17:56:30 Eliot Eshelman
June 20, 2020 @ 16:59:23 Eliot Eshelman
June 20, 2020 @ 16:55:51 Eliot Eshelman
June 20, 2020 @ 16:50:30 Eliot Eshelman
June 20, 2020 @ 16:38:22 Eliot Eshelman
June 20, 2020 @ 16:14:43 Eliot Eshelman
June 20, 2020 @ 16:07:47 Eliot Eshelman
June 20, 2020 @ 15:58:51 Eliot Eshelman
June 20, 2020 @ 15:48:27 Eliot Eshelman
June 20, 2020 @ 15:38:03 Eliot Eshelman
June 20, 2020 @ 15:36:42 Eliot Eshelman
May 20, 2020 @ 14:04:25 Eliot Eshelman
May 17, 2020 @ 12:46:28 Eliot Eshelman

Comments are closed.