In-Depth Comparison of NVIDIA Tesla “Pascal” GPU Accelerators

Revision for “In-Depth Comparison of NVIDIA Tesla “Pascal” GPU Accelerators” created on March 9, 2022 @ 12:08:13

TitleContentExcerpt
In-Depth Comparison of NVIDIA Tesla "Pascal" GPU Accelerators
<em>This article provides in-depth details of the NVIDIA Tesla P-series GPU accelerators (codenamed "Pascal"). "Pascal" GPUs improve upon the previous-generation <a href="https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-kepler-gpu-accelerators/" target="_blank" rel="noopener noreferrer">"Kepler"</a>, and <a href="https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-maxwell-gpu-accelerators/" target="_blank" rel="noopener noreferrer">"Maxwell"</a> architectures. Pascal GPUs were announced at GTC 2016 and began shipping in September 2016. <strong>Note: these have since been superseded by the <a href="https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/" rel="noopener noreferrer" target="_blank">NVIDIA Volta GPU architecture</a>.</strong></em>

<h2>Important changes available in the "Pascal" GPU architecture include:</h2>
<ul>
<li><strong>Exceptional performance</strong> with up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance.</li>
<li><strong>NVLink</strong> enables a 5X increase in bandwidth between Tesla Pascal GPUs and from GPUs to supported system CPUs (compared with PCI-E).</li>
<li><strong>High-bandwidth HBM2 memory</strong> provides a 3X improvement in memory performance compared to Kepler and Maxwell GPUs.</li>
<li><strong>Pascal Unified Memory</strong> allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).</li>
<li><strong>Up to 4MB L2 caches</strong> are available on Pascal GPUs (compared to 1.5MB on Kepler and 3MB on Maxwell).</li>
<li><strong>Native ECC Memory</strong> detects and corrects memory errors without any capacity or performance overhead.</li>
<li><strong>Energy-efficiency</strong> – Pascal GPUs deliver nearly twice the FLOPS per Watt as Kepler GPUs.</li>
<li><strong>Efficient SM units</strong> – Pascal’s architecture doubles the number of registers per thread</li>
<li><strong>Improved atomics</strong> in Pascal allow for an atomic add instruction in global memory (previous GPUs supported only shared memory atomics). Atomics can also be performed within the memory of other GPUs in the system.</li>
<li><strong>Half-precision FP</strong> support improves performance for low-precision operations (frequently used in neural network training)</li>
<li><strong>INT8</strong> support improves performance for low-precision integer operations (frequently used in neural network inference)</li>
<li><strong>Compute Preemption</strong> allows higher-priority tasks to interrupt currently-running tasks.</li>
</ul>

<h2>Tesla "Pascal" GPU Specifications</h2>
The table below summarizes the features of the available Tesla Pascal GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an <a href="https://www.microway.com/contact/" title="Talk to an Expert – Contact Microway">HPC expert</a>.

"HPC

<table>
<thead>
<tr>
<th>Feature</th>
<th>Tesla P100 SXM2 16GB</th>
<th>Tesla P100 PCI-E 16GB</th>
<th>Tesla P100 PCI-E 12GB</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">GPU Chip(s)</td><td colspan=3>Pascal GP100</td></tr>
<tr><td class="rowhead">Integer Operations (INT8)*</td><td colspan=3>-</td></tr>
<tr><td class="rowhead">Half Precision (FP16)*</td><td>21.2 TFLOPS</td><td colspan=2>18.7 TFLOPS</td></tr>
<tr><td class="rowhead">Single Precision (FP32)*</td><td>10.6 TFLOPS</td><td colspan=2>9.3 TFLOPS</td></tr>
<tr><td class="rowhead">Double Precision (FP64)*</td><td>5.3 TFLOPS</td><td colspan=2>4.7 TFLOPS</td></tr>
<tr><td class="rowhead">On-die HBM2 Memory</td><td colspan=2>16GB</td><td>12GB</td></tr>
<tr><td class="rowhead">Memory Bandwidth</td><td colspan=2>732 GB/s</td><td>549 GB/s</td></tr>
<tr><td class="rowhead">L2 Cache</td><td colspan=3>4 MB</td></tr>
<tr><td class="rowhead">Interconnect</td><td>NVLink + PCI-E 3.0</td><td colspan=2>PCI-Express 3.0</td></tr>
<tr><td class="rowhead">Theoretical transfer bandwidth</td><td>80 GB/s</td><td colspan=2>16 GB/s</td></tr>
<tr><td class="rowhead">Achievable transfer bandwidth</td><td>~66 GB/s</td><td colspan=2>~12 GB/s</td></tr>
<tr><td class="rowhead"># of SM Units</td><td colspan=3>56</td></tr>
<tr><td class="rowhead"># of single-precision CUDA Cores</td><td colspan=3>3584</td></tr>
<tr><td class="rowhead"># of double-precision CUDA Cores</td><td colspan=3>1792</td></tr>
<tr><td class="rowhead">GPU Base Clock</td><td>1328 MHz</td><td colspan=2>1126 MHz</td></tr>
<tr><td class="rowhead">GPU Boost Support</td><td colspan=3>Yes – Dynamic</td></tr>
<tr><td class="rowhead">GPU Boost Clock</td><td>1480 MHz</td><td colspan=2>1303 MHz</td></tr>
<tr><td class="rowhead">Compute Capability</td><td colspan=3>6.0</td></tr>
<tr><td class="rowhead">Workstation Support</td><td colspan=3>-</td></tr>
<tr><td class="rowhead">Server Support</td><td colspan=3>yes</td></tr>
<tr><td class="rowhead">Wattage (TDP)</td><td>300W</td><td colspan=2>250W</td></tr>
</tbody>
</table>
<em>* Measured with GPU Boost enabled</em>

"Deep

<table>
<thead>
<tr>
<th>Feature</th>
<th>Tesla P40 PCI-E 24GB</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">GPU Chip(s)</td><td>Pascal GP102</td></tr>
<tr><td class="rowhead">Integer Operations (INT8)*</td><td>47 TOPS</td></tr>
<tr><td class="rowhead">Half Precision (FP16)*</td><td>-</td></tr>
<tr><td class="rowhead">Single Precision (FP32)*</td><td>12 TFLOPS</td></tr>
<tr><td class="rowhead">Double Precision (FP64)*</td><td>-</td></tr>
<tr><td class="rowhead">Onboard GDDR5 Memory</td><td>24GB</td></tr>
<tr><td class="rowhead">Memory Bandwidth</td><td>346 GB/s</td></tr>
<tr><td class="rowhead">L2 Cache</td><td>3 MB</td></tr>
<tr><td class="rowhead">Interconnect</td><td>PCI-Express 3.0</td></tr>
<tr><td class="rowhead">Theoretical transfer bandwidth</td><td>16 GB/s</td></tr>
<tr><td class="rowhead">Achievable transfer bandwidth</td><td>~12 GB/s</td></tr>
<tr><td class="rowhead"># of SM Units</td><td>30</td></tr>
<tr><td class="rowhead"># of single-precision CUDA Cores</td><td colspan=3>3840</td></tr>
<tr><td class="rowhead">GPU Base Clock</td><td>1303 MHz</td></tr>
<tr><td class="rowhead">GPU Boost Support</td><td>Yes – Dynamic</td></tr>
<tr><td class="rowhead">GPU Boost Clock</td><td>1531 MHz</td></tr>
<tr><td class="rowhead">Compute Capability</td><td>6.1</td></tr>
<tr><td class="rowhead">Workstation Support</td><td>-</td></tr>
<tr><td class="rowhead">Server Support</td><td>yes</td></tr>
<tr><td class="rowhead">Wattage (TDP)</td><td>250W</td></tr>
</tbody>
</table>
<em>* Measured with GPU Boost enabled</em>

<hr />

<h2>Comparison between "Kepler", "Maxwell", and "Pascal" GPU Architectures</h2>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Kepler GK210</th>
<th>Maxwell GM200</th>
<th>Maxwell GM204</th>
<th>Pascal GP100</th>
<th>Pascal GP102</th>
</tr>
</thead>
<tbody>
<tr><td class="rowhead">Compute Capability</td><td>3.7</td><td colspan=2>5.2</td><td>6.0</td><td>6.1</td></tr>
<tr><td class="rowhead">Threads per Warp</td><td colspan=5>32</td></tr>
<tr><td class="rowhead">Max Warps per SM</td><td colspan=5>64</td></tr>
<tr><td class="rowhead">Max Threads per SM</td><td colspan=5>2048</td></tr>
<tr><td class="rowhead">Max Thread Blocks per SM</td><td>16</td><td colspan=4>32</td></tr>
<tr><td class="rowhead">Max Concurrent Kernels</td><td colspan=3>32</td><td>128</td><td>32</td></tr>
<tr><td class="rowhead">32-bit Registers per SM</td><td>128 K</td><td colspan=4>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread Block</td><td colspan=5>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread</td><td colspan=5>255</td></tr>
<tr><td class="rowhead">Max Threads per Thread Block</td><td colspan=5>1024</td></tr>
<tr><td class="rowhead">L1 Cache Configuration</td><td colspan=3>split with shared memory</td><td colspan=2>24KB dedicated L1 cache</td></tr>
<tr><td class="rowhead">Shared Memory Configurations</td><td>16KB + 112KB L1 Cache<br /><br />32KB + 96KB L1 Cache<br /><br />48KB + 80KB L1 Cache<br /><br /><em>(128KB total)</em></td><td colspan=2>96KB dedicated</td><td>64KB dedicated</td><td>96KB dedicated</td></tr>
<tr><td class="rowhead">Max Shared Memory per Thread Block</td><td colspan=5>48KB</td></tr>
<tr><td class="rowhead">Max X Grid Dimension</td><td colspan=5>2<sup>32-1</sup></td></tr>
<tr><td class="rowhead">Hyper-Q</td><td colspan=5>Yes</td></tr>
<tr><td class="rowhead">Dynamic Parallelism</td><td colspan=5>Yes</td></tr>
</tbody>
</table>
<em>For a complete listing of Compute Capabilities, reference the <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability" target="_blank" rel="noopener noreferrer">NVIDIA CUDA Documentation</a></em>
<hr />

<h2>Additional Tesla "Pascal" GPU products</h2>
NVIDIA has also released Tesla P4 GPUs. These GPUs are primarily for embedded and hyperscale deployments, and are not expected to be used in the HPC space.

<h2>Hardware-accelerated video encoding and decoding</h2>
All NVIDIA "Pascal" GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s <a href="https://developer.nvidia.com/video-encode-decode-gpu-support-matrix" target="_blank" rel="noopener noreferrer">encoder/decoder support matrix</a>. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s <a href="https://developer.nvidia.com/nvidia-video-codec-sdk" target="_blank" rel="noopener noreferrer">Video Codec SDK</a>.



Old New Date Created Author Actions
March 9, 2022 @ 12:08:13 Brett Newman
June 20, 2020 @ 18:13:37 Eliot Eshelman
June 20, 2020 @ 18:13:30 [Autosave] Eliot Eshelman
February 9, 2017 @ 13:37:19 Eliot Eshelman
January 3, 2017 @ 13:01:00 Eliot Eshelman
December 28, 2016 @ 18:28:38 Eliot Eshelman
October 21, 2016 @ 21:45:26 Eliot Eshelman
October 18, 2016 @ 10:50:36 Eliot Eshelman
September 13, 2016 @ 11:58:05 Eliot Eshelman
September 8, 2016 @ 08:53:52 Eliot Eshelman
August 3, 2016 @ 16:41:33 Eliot Eshelman
July 13, 2016 @ 12:42:33 Eliot Eshelman
July 1, 2016 @ 15:13:15 Eliot Eshelman
July 1, 2016 @ 15:10:07 Eliot Eshelman
July 1, 2016 @ 15:05:17 Eliot Eshelman
July 1, 2016 @ 12:56:58 Eliot Eshelman
June 30, 2016 @ 16:59:16 Eliot Eshelman

Comments are closed.