Compute performance has been exponentially increasing for the entirety of your life – it doesn’t matter what your age is. This week at NVIDIA’s GTC 2012 conference, we’ve seen that GPUs are still leading the charge. The new NVIDIA “Kepler” K10 and K20 GPU Accelerators will be offering 4.58 TFLOPS single-precision and over 1 TFLOPS double-precision, respectively.
In today’s “Inside Kepler” session Lars Nyland, from NVIDIA’s architecture group, and Stephen Jones, from the NVIDIA CUDA group, dove into the improved architecture and programmability of the GK110 GPU.
The frame of reference was the existing “Fermi” GPUs. That architecture was out of headroom in terms of power consumption. To achieve an increase in performance, NVIDIA’s team needed to improve the power-efficiency of their architecture. What they reached was a 3x improvement in energy efficiency and a commensurate 3x increase of compute performance.
But Kepler is not just a redesigned Fermi. Substantial new features have been added to the architecture, many of which simplify the programmability of the GPU. At a high level, these are:
- Improvements to the Streaming Multiprocessor (now called SMX)
- Dynamic Parallelism – the ability to call library functions and spawn new CUDA kernels from within the GPU itself
- Hyper-Q – expanding the number of concurrent processes calling a GPU from 1 to 32
SMX Streaming Multiprocessor
Lars and Stephen stepped through a multitude of improvements made since the Fermi SM: twice as many blocks per SM, more threads per SM, twice the register file bandwidth, twice the register file size and doubled shared memory bandwidth. The new ISA encoding allows threads to access up to 255 registers (up from 63 in Fermi), which has been a stumbling block for many existing applications.
There are also additions and improvements to the SMX instructions. The new SHFL (shuffle) instruction allows data exchange between threads in a warp, avoiding use of shared memory. Enhancements to the atomic (ATOM) instructions provide 2x-10x performance improvement of those instructions. The difference is enough that atomics are now fast enough for use within the inner loops of kernels (greatly simplifying sections of code such as reductions).
The L2 cache now offers double the capacity and double the bandwidth. Additionally, efficiency improvements to the DRAM ECC implementation have led to an average 66% reduction in ECC lookup overhead.
In what conversations I’ve already had with GPU developers, dynamic parallelism has certainly garnered the most interest. The ability to make a library call (such as BLAS or FFT) from within a CUDA kernel is certainly new. Furthermore, it’s now possible to spawn new kernels from within the GPU, which allows for a whole new set of parallel batched and nested algorithms.
Upon questioning from the audience, Stephen quipped “you can now write a fork bomb in CUDA.” But in all seriousness this feature greatly increases the flexibility of the GPU and simplifies development of complex algorithms.
A large effort has been made to improve the concurrency of jobs on the GPU. By allowing for true 32-way concurrency on Kepler, much more of the GPU can be kept active at any given time. Fermi provides 16-way concurrency, but all streams are multiplexed into a single queue. Removing this restriction was a significant challenge for the architects, but results in significantly better efficiency and allows for finer-grain parallelism. When all the cores of the GPU are constantly occupied, jobs complete much faster.
There’s much more to the architecture which can be explored – NVIDIA is providing an in depth white paper on the new Kepler architecture. But with a few small changes to your CUDA code, you will be able to take advantage of significant new capabilities. Read up on the changes, download the new CUDA 5 alpha-release and start considering how you’re going to take advantage of this new flexibility.
Update: the Kepler architecture white paper is available on NVIDIA’s site: