In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Revision for “In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators” created on August 17, 2018 @ 17:28:00

TitleContentExcerpt
In-Depth Comparison of NVIDIA Tesla "Volta" GPU Accelerators
<em>This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed "Volta"). "Volta" GPUs improve upon the previous-generation <a href="https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-pascal-gpu-accelerators/" rel="noopener" target="_blank">"Pascal"</a> architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our <a href="https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/" target="_blank" rel="noopener">Tesla V100 Price Analysis</a> and <a href="https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/" target="_blank" rel="noopener">Tesla V100 GPU Review</a> for more extended discussion.</em>

<h2>Important features available in the "Volta" GPU architecture include:</h2>
<ul>
<li><strong>Exceptional HPC performance</strong> with up to 7.8 TFLOPS double- and 15.7 TFLOPS single-precision floating-point performance.</li>
<li><strong>Deep Learning training performance</strong> with up to 125 TFLOPS FP16 half-precision floating-point performance.</li>
<li><strong>Deep Learning inference performance</strong> with up to 62.8 TeraOPS INT8 8-bit integer performance.</li>
<li><strong>Simultaneous execution of FP32 and INT32 operations</strong> improves the overall computational throughput of the GPU</li>
<li><strong>NVLink</strong> enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).</li>
<li><strong>High-bandwidth HBM2 memory</strong> provides a 3X improvement in memory performance compared to previous-generation GPUs.</li>
<li><strong>Enhanced Unified Memory</strong> allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).</li>
<li><strong>Native ECC Memory</strong> detects and corrects memory errors without any capacity or performance overhead.</li>
<li><strong>Combined L1 Cache and Shared Memory</strong> provides additional flexibility and higher performance than Pascal.</li>
<li><strong>Cooperative Groups</strong> – a new programming model introduced in CUDA 9 for organizing groups of communicating threads</li>
</ul>

<h2>Tesla "Volta" GPU Specifications</h2>
The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an <a href="http://www.microway.com/contact/" title="Talk to an Expert – Contact Microway">HPC expert</a>.

<table>
<thead>
<tr>
<th>Feature</th>
<th>Tesla V100 SXM2 16GB/32GB</th>
<th>Tesla V100 PCI-E 16GB/32GB</th>
<th>Quadro GV100 32GB</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">GPU Chip(s)</td><td colspan=3>Volta GV100</td></tr>
<tr><td class="rowhead">TensorFLOPS</td><td>125 TFLOPS</td><td>112 TFLOPS</td><td>118.5 TFLOPS</td></tr>
<tr><td class="rowhead">Integer Operations (INT8)*</td><td>62.8 TOPS</td><td>56.0 TOPS</td><td>59.3 TOPS</td></tr>
<tr><td class="rowhead">Half Precision (FP16)*</td><td>31.4 TFLOPS</td><td>28 TFLOPS</td><td>29.6 TFLOPS</td></tr>
<tr><td class="rowhead">Single Precision (FP32)*</td><td>15.7 TFLOPS</td><td>14.0 TFLOPS</td><td>14.8 TFLOPS</td></tr>
<tr><td class="rowhead">Double Precision (FP64)*</td><td>7.8 TFLOPS</td><td>7.0 TFLOPS</td><td>7.4 TFLOPS</td></tr>
<tr><td class="rowhead">On-die HBM2 Memory</td><td colspan=2>16GB or 32GB</td><td>32GB</td></tr>
<tr><td class="rowhead">Memory Bandwidth</td><td colspan=2>900 GB/s</td><td>870 GB/s</td></tr>
<tr><td class="rowhead">L2 Cache</td><td colspan=3>6 MB</td></tr>
<tr><td class="rowhead">Interconnect</td><td>NVLink 2.0 (6 bricks) + PCI-E 3.0</td><td>PCI-Express 3.0</td><td>NVLink 2.0 (4 bricks) + PCI-E 3.0</td></tr>
<tr><td class="rowhead">Theoretical transfer bandwidth (bidirectional)</td><td>300 GB/s</td><td>32 GB/s</td><td>200 GB/s</td></tr>
<tr><td class="rowhead">Achievable transfer bandwidth</td><td>143.5 GB/s</td><td>~12 GB/s</td><td>TBM</td></tr>
<tr><td class="rowhead"># of SM Units</td><td colspan=3>80</td></tr>
<tr><td class="rowhead"># of Tensor Cores</td><td colspan=3>640</td></tr>
<tr><td class="rowhead"># of integer INT32 CUDA Cores</td><td colspan=3>5120</td></tr>
<tr><td class="rowhead"># of single-precision FP32 CUDA Cores</td><td colspan=3>5120</td></tr>
<tr><td class="rowhead"># of double-precision FP64 CUDA Cores</td><td colspan=3>2560</td></tr>
<tr><td class="rowhead">GPU Base Clock</td><td>not published</td><td>1245Mhz</td><td>not published</td></tr>
<tr><td class="rowhead">GPU Boost Support</td><td colspan=3>Yes – Dynamic</td></tr>
<tr><td class="rowhead">GPU Boost Clock</td><td>1530 MHz</td><td>~1380 MHz</td><td>TBM</td></tr>
<tr><td class="rowhead">Compute Capability</td><td colspan=3>7.0</td></tr>
<tr><td class="rowhead">Workstation Support</td><td colspan=2>-</td><td>yes</td></tr>
<tr><td class="rowhead">Server Support</td><td colspan=2>yes</td><td>specific server models only</td>
<tr><td class="rowhead">Cooling Type</td><td colspan=2>Passive</td><td>Active</td></tr>
</tr>
<tr><td class="rowhead">Wattage (TDP)</td><td>300W</td><td colspan=2>250W</td></tr>
</tbody>
</table>
<em>* theoretical peak performance with GPU Boost enabled</em>

<h2>Comparison between "Kepler", "Pascal", and "Volta" GPU Architectures</h2>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Kepler GK210</th>
<th>Pascal GP100</th>
<th>Volta GV100</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">Compute Capability &Hat;</td><td>3.7</td><td>6.0</td><td>7.0</td></tr>
<tr><td class="rowhead">Threads per Warp</td><td colspan=3>32</td></tr>
<tr><td class="rowhead">Max Warps per SM</td><td colspan=3>64</td></tr>
<tr><td class="rowhead">Max Threads per SM</td><td colspan=3>2048</td></tr>
<tr><td class="rowhead">Max Thread Blocks per SM</td><td>16</td><td colspan=2>32</td></tr>
<tr><td class="rowhead">Max Concurrent Kernels</td><td>32</td><td colspan=2>128</td></tr>
<tr><td class="rowhead">32-bit Registers per SM</td><td>128 K</td><td colspan=2>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread Block</td><td colspan=3>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread</td><td colspan=3>255</td></tr>
<tr><td class="rowhead">Max Threads per Thread Block</td><td colspan=3>1024</td></tr>
<tr><td class="rowhead">L1 Cache Configuration</td><td>split with shared memory</td><td>24KB dedicated L1 cache</td><td>32KB ~ 128KB<br />(dynamic with shared memory)</td></tr>
<tr><td class="rowhead">Shared Memory Configurations</td><td>16KB + 112KB L1 Cache<br /><br />32KB + 96KB L1 Cache<br /><br />48KB + 80KB L1 Cache<br /><br /><em>(128KB total)</em></td><td>64KB</td><td>configurable up to 96KB; remainder for L1 Cache<br /><br /><em>(128KB total)</em></td></tr>
<tr><td class="rowhead">Max Shared Memory per Thread Block</td><td colspan=2>48KB</td><td>96KB*</td></tr>
<tr><td class="rowhead">Max X Grid Dimension</td><td colspan=3>2<sup>32-1</sup></td></tr>
<tr><td class="rowhead">Hyper-Q</td><td colspan=3>Yes</td></tr>
<tr><td class="rowhead">Dynamic Parallelism</td><td colspan=3>Yes</td></tr>
<tr><td class="rowhead">Unified Memory</td><td>No</td><td colspan=2>Yes</td></tr>
<tr><td class="rowhead">Pre-Emption</td><td>No</td><td colspan=2>Yes</td></tr>
</tbody>
</table>
<em>&Hat; For a complete listing of Compute Capabilities, reference the <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability" target="_blank">NVIDIA CUDA Documentation</a></em>
<em>* above 48 KB requires dynamic shared memory</em>

<h2>Hardware-accelerated video encoding and decoding</h2>
All NVIDIA "Volta" GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s <a href="https://developer.nvidia.com/video-encode-decode-gpu-support-matrix" target="_blank">encoder/decoder support matrix</a>. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s <a href="https://developer.nvidia.com/nvidia-video-codec-sdk" target="_blank">Video Codec SDK</a>.



Old New Date Created Author Actions
August 17, 2018 @ 17:28:00 Brett Newman
August 17, 2018 @ 17:25:18 [Autosave] Brett Newman
April 6, 2018 @ 11:21:12 Brett Newman
March 13, 2018 @ 14:19:43 Eliot Eshelman
March 13, 2018 @ 14:19:19 [Autosave] Eliot Eshelman
March 12, 2018 @ 18:24:50 Eliot Eshelman
March 12, 2018 @ 18:23:08 Eliot Eshelman
March 12, 2018 @ 18:21:58 Eliot Eshelman
March 12, 2018 @ 18:14:06 Eliot Eshelman
March 12, 2018 @ 17:51:48 Eliot Eshelman
March 12, 2018 @ 17:33:39 Eliot Eshelman
March 12, 2018 @ 17:26:33 Eliot Eshelman
March 11, 2018 @ 16:47:14 Eliot Eshelman

Comments are closed.