The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The higher-end K20X, which is used in the Titan and Bluewaters supercomputers, contains a massive 2,688 CUDA cores and achieves peak single-precision floating-point performance of 3.95 Tflops. In contrast, the Fermi-architecture Tesla M2090 (compute capability 2.0) has peak single-precision performance of 1.3 Tflops.
In addition to the increase in raw power, GK110 includes a number of features designed to facilitate efficient GPU utilization. Of these, Hyper-Q technology and support for dynamic parallelism have been particularly well publicized. Hyper-Q facilitates the concurrent execution of multiple kernels on a single device and also enables multiple CPU processes to simultaneously launch work on a single GPU. The dynamic parallelism feature means that kernels can be launched from the device, which greatly simplifies the implementation and improves the performance of divide-and-conquer algorithms, for example. Other Kepler additions include support for bindless textures, which offer greater flexibility and performance than texture references. Shared-memory bank widths have also increased from 4 to 8 bytes, with a corresponding increase in shared-memory bandwidth and a reduction in bank conflicts in many applications.
In this white paper, we provide an overview of the new features of Kepler GK110 and highlight the differences in functionality and performance between the new architecture and Fermi. We cite a number of examples drawn from disparate sources on the web, but also draw on our own experiences involving QUDA, an open-source library for performing Lattice QCD calculations on GPUs. In general, codes developed on Fermi ought to see substantial performance gains on GK110 without any modification. In the case of data-dependent and recursive algorithms, however, far greater gains may be achieved by exploiting dynamic parallelism. More generally, relatively minor code modifications, such as switching to bindless textures, or changing shared-memory accesses to take advantage of increased bandwidth, can also result in significant improvements in performance.