In my previous post, I presented a brief introduction to the CUB library of CUDA primitives written by Duane Merrill of NVIDIA. CUB provides a set of highly-configurable software components, which include warp- and block-level kernel components as well as device-wide primitives. This time around, we will actually look at performance figures for codes that utilize CUB primitives. We will also briefly compare the CUB-based codes to programs that use the analogous Thrust routines, both from a performance and programmability perspective. These comparisons utilize the CUB v1.3.1 and Thrust v1.7.0 releases and CUDA 6.0.
Before we proceed, I need to issue one disclaimer: the examples below were written after a limited amount of experimentation with the CUB library, and they do not necessarily represent the most optimized implementations. However, these examples do illustrate the flexibility of the API and they give an idea of the kind of performance that can be achieved using CUB with only modest programming effort.
I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors).
First, please feel free to contact one of Microway’s experts. We’d be happy to work with you on your project to ensure your design will function correctly (both in theory and in practice).
It is tempting to just look at the number of PCI-Express slots in the systems you’re evaluating and assume they’re all the same. Unfortunately, it’s not so simple, because each CPU only has a certain amount of bandwidth available. Additionally, certain high-performance features – such as NVIDIA’s GPU Direct technology – require that all components be attached to the same PCI-Express root complex. Servers and workstations with multiple processors have multiple PCI-Express root complexes.
CUB – a configurable C++ template library of high-performance CUDA primitives
Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth of up to 177 Gigabytes per second and peak double-precision floating-point performance of 665 Gigaflops. By comparison, today’s Tesla K40 (Kepler GK110b architecture) has peak memory bandwidth of 288 Gigabytes per second and provides reported peak double-precision performance of over 1.4 Teraflops. However, the K40′s reign as the most advanced GPGPU hardware is coming to an end, and Kepler will shortly be superseded by Maxwell-class cards.
Actually achieving optimal performance on diverse GPU architectures can be challenging, since it relies on the implementation of carefully-crafted kernels that incorporate extensive knowledge of the underlying hardware and which take full advantage of relevant features of the CUDA programming model. This places a considerable burden on the CUDA developer seeking to port her application to a new generation of GPUs or looking to ensure performance across a range of architectures.
Fortunately, many CUDA applications are formulated in terms of a small set of primitives, such as parallel reduce, scan, or sort. Before attempting to handcraft these primitive operations ourselves, we should consider using one of the libraries of optimized primitives available to CUDA developers. Such libraries include Thrust and CUDPP, but in this post, we will focus on the CUB library developed by Duane Merrill of NVIDIA Research. CUB – the name derives from “CUDA Unbound” – provides generic high-performance primitives targeting multiple levels of application development. For example, CUB supports a set of device-wide primitives, which are called from the host, and in this regard, the functionality provided by CUB overlaps with Thrust to some degree. However, unlike Thrust, CUB also provides a set of kernel components that operate at the thread-block and thread-warp levels.
Many within the HPC community have been eagerly awaiting the new Intel Xeon E5-4600 v2 CPUs. To those already familiar with the “Ivy Bridge” architecture in the Xeon E5-2600 v2 processors, many of the updated features of these 4-socket Xeon E5-4600 v2 “Ivy-Bridge” CPUs should seem very familiar. Read on to learn the details.
Important changes available in the Xeon E5-4600 v2 “Ivy Bridge” CPUs include:
- Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
- Support for DDR3 memory speeds up to 1866MHz
- AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats. These operations are of particular importance to graphics and image processing applications.
- Intel APIC Virtualization (APICv) provides increased virtualization performance
- Improved PCI-Express generation 3.0 support with superior compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for point-to-point transfers
Intel Xeon E5-4600 v2 Series Specifications
HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.
The phrase “High Performance Computing” gets thrown around a lot, but HPC means different things to different people. Every group with compute needs has its own requirements and workflow. There are many groups that would never need 100+ TFLOPS compute and 1+ PB storage, but some need more! If your current system just isn’t sufficient, keep reading to learn what type of HPC might fit your needs.
Workstations and Servers: Closer Than Ever
If you don’t take into account their physical appearance, the distinction between rackmount servers and desktop workstations has become very blurred. It’s possible to fit 32 processor cores and 512GB of memory or four x16 gen 3.0 double-width GPUs within a workstation. There are, of course, tradeoffs when comparing high-end workstations with rackmount servers, but you may be surprised how much you can achieve without jumping to a traditional HPC cluster.
NVIDIA Tesla K40 is now the leading Tesla GPU for performance. Here are some important use-cases where Tesla K40 might greatly accelerate your GPU-accelerated applications:
Pick Tesla K40 for
Large Data Sets
GPU memory has always been at a greater premium compared to its CPU equivalent. If you have a large data set, the 12GB of GDDR5 on Tesla K40 could be an excellent match for your application. Many common CUDA codes break apart data into chunks explicitly sized to the GPU memory space and start their compute algorithm on each chunk. This was previously limited to 5GB (Tesla K20) or 6GB (Tesla K20X). With Tesla K40, your chunks are twice the size:
|Max Memory Bandwidth
Frequent PCI-E Bus Transfers
Tesla K40 fully supports PCI-Express Gen3. CUDA codes that constantly move substantial data (>6GB/sec) across the PCI-E bus will benefit greatly from Gen3. We’ve seen Tesla K40 GPUs deliver up to 10GB/sec in NVIDIA’s CUDA bandwidth tests.
If you asked around at SC this year, some attendees might have told you there wasn’t much new going on. It’s true that not every company was launching new hardware with 2X performance, but there were significant announcements. Some are shipping now and some are looking forward into 2014 or 2015. See our top picks below.
NVIDIA’s latest Tesla accelerator is without a doubt the most powerful GPU available. With almost 3,000 CUDA cores and 12GB GDDR5 memory, it wins in practically every* performance test you’ll see. As with the “Kepler” K20 GPUs, the Tesla K40 supports NVIDIA’s latest SMX, Dynamic Parallelism and Hyper-Q capabilities (CUDA compute capability 3.5). It also introduces professional-level GPU Boost capability to squeeze every bit of performance your code can pull from the GPU’s 235W power envelope.
Maximum GPU Memory and Compute Performance: Tesla K40 GPU Accelerator
Integrated in Microway NumberSmasher GPU Servers and GPU Clusters
- 2880 CUDA GPU cores (GK110b)
- 4.2 TFLOPS single; 1.4 TFLOPS double-precision
- 12GB GDDR5 memory
- Memory bandwidth up to 288 GB/s
- PCI-E x16 Gen3 interface to system
- GPU Boost increased clock speeds
- Supports Dynamic Parallelism and HyperQ features
- Active and Passive heatsinks available for installation in workstations and specially-designed GPU servers
The new GPU also leverages PCI-E 3.0 to achieve over 10 gigabytes per second transfers between the host (CPUs) and the devices (GPUs):
The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The higher-end K20X, which is used in the Titan and Bluewaters supercomputers, contains a massive 2,688 CUDA cores and achieves peak single-precision floating-point performance of 3.95 Tflops. In contrast, the Fermi-architecture Tesla M2090 (compute capability 2.0) has peak single-precision performance of 1.3 Tflops.
This post is Topic #3 (post 3) in our series Parallel Code: Maximizing your Performance Potential.
Many applications contain algorithms which make use of multi-dimensional arrays (or matrices). For cases where threads need to index the higher dimensions of the array, strided accesses can’t really be avoided. In cases where strided access is actually avoidable, every effort to avoid accesses with a stride greater than one should be taken.
So all this advice is great and all, but I’m sure you’re wondering “What actually is strided memory access?” The following example will illustrate this phenomenon and outline its effect on the effective bandwidth:
__global__ void strideExample (float *outputData, float *inputData, int stride=2)
int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
outputData[index] = inputData[index];
In the above code, threads within a warp access data words in memory with a stride of 2. This leads to a load of two L1 cache lines per warp. The actual accessing of the memory is shown below.
Posted in Development
Tagged CUDA, gpu