In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators

Revision for “In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators” created on September 16, 2020 @ 12:25:09 [Autosave]

In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators
<em>This article provides in-depth details of the NVIDIA Tesla K-series GPU accelerators (codenamed "Kepler"). "Kepler" GPUs improve upon the previous-generation "Fermi" architecture.

For more information on other Tesla GPU architectures, please refer to:</em>
<li><a href="">In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators</a></li>
<li><a href="">In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators</a></li>

<h2>Important changes available in the "Kepler" GPU architecture include:</h2>
<li><strong>Dynamic parallelism</strong> supports GPU threads launching new threads. This simplifies parallel programming and avoids unnecessary communication between the GPU and the CPU.</li>
<li><strong>HyperQ</strong> enables up to 32 work queues per GPU. Multiple CPU cores and MPI processes are therefore able to address the GPU concurrently. Efficient utilization of the GPU resources is greatly improved.</li>
<li><strong>SMX architecture</strong> provides a new streaming multiprocessor design optimized for performance per watt. Each SM contains 192 CUDA cores (up from 32 cores in Fermi).</li>
<li><strong>PCI-Express generation 3.0</strong> doubles data transfer rates between the host and the GPU.</li>
<li><strong>GPU Boost</strong> increases the clock speed of all CUDA cores, providing a 30+% performance boost for many common applications.</li>
<li>Each SM contains more than twice as many registers (with another 2X on Tesla K80). Each thread may address <strong>four times as many registers</strong>.</li>
<li>Shared Memory Bank width is doubled. Likewise, <strong>shared memory bandwidth is doubled</strong>. Tesla K80 features an additional 2X increase in shared memory size.</li>
<li><strong>Shuffle instructions</strong> allow threads to share data without use of shared memory.</li>

<h2>"Kepler" Tesla GPU Specifications</h2>
The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an <a href="" title="Talk to an Expert – Contact Microway">HPC expert</a>.


<th>Tesla K80</th>
<th>Tesla K40</th>
<tr><td class="rowhead">GPU Chip(s)</td><td>2x Kepler GK210</td><td>Kepler GK110b</td></tr>
<tr><td class="rowhead">Peak Single Precision (base clocks)</td><td>5.60 TFLOPS (both GPUs combined)</td><td>4.29 TFLOPS</td></tr>
<tr><td class="rowhead">Peak Double Precision (base clocks)</td><td>1.87 TFLOPS (both GPUs combined)</td><td>1.43 TFLOPS</td></tr>
<tr><td class="rowhead">Peak Single Precision (GPU Boost)</td><td>8.73 TFLOPS (both GPUs combined)</td><td>5.04 TFLOPS</td></tr>
<tr><td class="rowhead">Peak Double Precision (GPU Boost)</td><td>2.91 TFLOPS (both GPUs combined)</td><td>1.68 TFLOPS</td></tr>
<tr><td class="rowhead">Onboard GDDR5 Memory<sup>1</sup></td><td>24GB (12GB per GPU)</td><td>12 GB</td></tr>
<tr><td class="rowhead">Memory Bandwidth<sup>1</sup></td><td>480 GB/s (240 GB/s per GPU)</td><td>288 GB/s</td></tr>
<tr><td class="rowhead">PCI-Express Generation</td><td colspan=2>3.0</td></tr>
<tr><td class="rowhead">Achievable PCI-E transfer bandwidth</td><td>12 GB/s</td><td>12 GB/s</td></tr>
<tr><td class="rowhead"># of SMX Units</td><td>26 (13 per GPU)</td><td>15</td></tr>
<tr><td class="rowhead"># of CUDA Cores</td><td>4992 (2496 per GPU)</td><td>2880</td></tr>
<tr><td class="rowhead">Memory Clock</td><td>2500 MHz</td><td>3004 MHz</td></tr>
<tr><td class="rowhead">GPU Base Clock</td><td>560 MHz</td><td>745 MHz</td></tr>
<tr><td class="rowhead">GPU Boost Support</td><td>Yes – Dynamic</td><td>Yes – Static</td></tr>
<tr><td class="rowhead">GPU Boost Clocks</td><td>23 levels between 562 MHz and 875 MHz</td><td>810 MHz<br />875 MHz</td></tr>
<tr><td class="rowhead">Architecture features</td><td colspan=2>SMX, Dynamic Parallelism, Hyper-Q</td></tr>
<tr><td class="rowhead">Compute Capability</td><td>3.7</td><td>3.5</td></tr>
<tr><td class="rowhead">Workstation Support</td><td>-</td><td>Yes</td></tr>
<tr><td class="rowhead">Server Support</td><td colspan=2>Yes</td></tr>
<tr><td class="rowhead">Wattage (TDP)</td><td>300W (plus Zero Power Idle)</td><td>235W</td></tr>
<em>1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.</em>


The models listed below are still available for sale in certain scenarios, but are not generally recommended. They offer lower performance than Tesla K40 or K80 (and do not cost any less).

<th>Tesla K20X</th>
<th>Tesla K20</th>
<th>Tesla K10</th>
<tr><td class="rowhead">GPU Chip(s)</td><td colspan=2>Kepler GK110</td><td>2x Kepler GK104</td></tr>
<tr><td class="rowhead">Peak Single Precision</td><td>3.95 TFLOPS</td><td>3.52 TFLOPS</td><td>2.3 TFLOPS per GPU</td></tr>
<tr><td class="rowhead">Peak Double Precision</td><td>1.32 TFLOPS</td><td>1.17 TFLOPS</td><td>95 GFLOPS per GPU</td></tr>
<tr><td class="rowhead">Onboard GDDR5 Memory<sup>1</sup></td><td>6GB</td><td>5GB</td><td>4GB per GPU</td></tr>
<tr><td class="rowhead">Memory Bandwidth<sup>1</sup></td><td>250 GB/s</td><td>208 GB/s</td><td>160 GB/s per GPU</td></tr>
<tr><td class="rowhead">PCI-Express Generation</td><td colspan=2>2.0</td><td>3.0</td></tr>
<tr><td class="rowhead">Achievable PCI-E transfer bandwidth</td><td colspan=2>6 GB/s</td><td>11 GB/s</td></tr>
<tr><td class="rowhead"># of SMX Units</td><td>14</td><td>13</td><td>8 per GPU</td></tr>
<tr><td class="rowhead"># of CUDA Cores</td><td>2688</td><td>2496</td><td>1536 per GPU</td></tr>
<tr><td class="rowhead">Memory Clock</td><td>2600 MHz</td><td>2600 MHz</td><td>2500 MHz</td></tr>
<tr><td class="rowhead">GPU Base Clock</td><td>732 MHz</td><td>705 MHz</td><td>745 MHz</td></tr>
<tr><td class="rowhead">GPU Boost Support</td><td>Limited</td><td>-</td><td>-</td></tr>
<tr><td class="rowhead">GPU Boost Clocks</td><td>758 MHz<br />784 MHz</td><td>-</td><td>-</td></tr>
<tr><td class="rowhead">Architecture features</td><td colspan=2>SMX, Dynamic Parallelism, Hyper-Q</td><td>SMX</td></tr>
<tr><td class="rowhead">Compute Capability</td><td colspan=2>3.5</td><td>3.0</td></tr>
<tr><td class="rowhead">Workstation Support</td><td>-</td><td>Yes</td><td>-</td></tr>
<tr><td class="rowhead">Server Support</td><td colspan=3>Yes</td></tr>
<tr><td class="rowhead">Wattage (TDP)</td><td>235W</td><td colspan=2>225W</td></tr>
<em>1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.</em>

<h2>Comparison between "Fermi" and "Kepler" GPU Architectures</h2>
<th>Fermi GF100</th>
<th>Fermi GF104</th>
<th>Kepler GK104</th>
<th>Kepler GK110(b)</th>
<th>Kepler GK210</th>
<tr><td class="rowhead">Compute Capability</td><td>2.0</td><td>2.1</td><td>3.0</td><td>3.5</td><td>3.7</td></tr>
<tr><td class="rowhead">Threads per Warp</td><td colspan=5>32</td></tr>
<tr><td class="rowhead">Max Warps per SM</td><td colspan=2>48</td><td colspan=3>64</td></tr>
<tr><td class="rowhead">Max Threads per SM</td><td colspan=2>1536</td><td colspan=3>2048</td></tr>
<tr><td class="rowhead">Max Thread Blocks per SM</td><td colspan=2>8</td><td colspan=3>16</td></tr>
<tr><td class="rowhead">32-bit Registers per SM</td><td colspan=2>32 K</td><td colspan=2>64 K</td><td>128 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread Block</td><td colspan=2>32 K</td><td colspan=3>64 K</td></tr>
<tr><td class="rowhead">Max Registers per Thread</td><td colspan=3>63</td><td colspan=2>255</td></tr>
<tr><td class="rowhead">Max Threads per Thread Block</td><td colspan=5>1024</td></tr>
<tr><td class="rowhead">Shared Memory Configurations<br /><em>(remainder is configured as L1 Cache)</em></td><td colspan=2>16KB + 48KB L1 Cache<br /><br />48KB + 16KB L1 Cache<br /><br /><em>(64KB total)</em></td><td colspan=2>16KB + 48KB L1 Cache<br /><br />32KB + 32KB L1 Cache<br /><br />48KB + 16KB L1 Cache<br /><br /><em>(64KB total)</em></td><td>16KB + 112KB L1 Cache<br /><br />32KB + 96KB L1 Cache<br /><br />48KB + 80KB L1 Cache<br /><br /><em>(128KB total)</em></td></tr>
<tr><td class="rowhead">Max Shared Memory per Thread Block</td><td colspan=5>48KB</td></tr>
<tr><td class="rowhead">Max X Grid Dimension</td><td colspan=2>2<sup>16-1</sup></td><td colspan=3>2<sup>32-1</sup></td></tr>
<tr><td class="rowhead">Hyper-Q</td><td>-</td><td>-</td><td>-</td><td colspan=2>Yes</td></tr>
<tr><td class="rowhead">Dynamic Parallelism</td><td>-</td><td>-</td><td>-</td><td colspan=2>Yes</td></tr>

Old New Date Created Author Actions
September 16, 2020 @ 12:25:09 [Autosave] Brett Newman
April 6, 2018 @ 11:19:42 Brett Newman
March 13, 2018 @ 15:20:49 Brett Newman
March 13, 2018 @ 15:19:35 Brett Newman
March 13, 2018 @ 15:17:59 Brett Newman
March 8, 2016 @ 10:32:40 Eliot Eshelman
March 8, 2016 @ 10:32:10 [Autosave] Eliot Eshelman
March 3, 2016 @ 16:09:59 Eliot Eshelman
March 27, 2015 @ 13:50:37 Eliot Eshelman
March 27, 2015 @ 13:12:08 Eliot Eshelman
March 20, 2015 @ 08:34:41 Eliot Eshelman
December 15, 2014 @ 11:06:11 Eliot Eshelman
November 18, 2014 @ 22:51:14 Eliot Eshelman
November 17, 2014 @ 17:05:33 Eliot Eshelman
November 17, 2014 @ 17:05:31 Eliot Eshelman
December 3, 2013 @ 13:12:12 Eliot Eshelman
November 27, 2013 @ 14:09:47 Eliot Eshelman
November 15, 2013 @ 09:46:22 Eliot Eshelman
November 13, 2013 @ 22:01:42 Eliot Eshelman
November 11, 2013 @ 22:57:33 Eliot Eshelman
November 11, 2013 @ 22:54:52 Eliot Eshelman
November 11, 2013 @ 22:42:56 Eliot Eshelman
November 11, 2013 @ 22:41:22 Eliot Eshelman

Comments are closed.