Introducing CUDA UnBound (CUB)

justin.foley

·

April 14, 2014

CUB – a configurable C++ template library of high-performance CUDA primitives

Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth of up to 177 Gigabytes per second and peak double-precision floating-point performance of 665 Gigaflops. By comparison, today’s Tesla K40 (Kepler GK110b architecture) has peak memory bandwidth of 288 Gigabytes per second and provides reported peak double-precision performance of over 1.4 Teraflops. However, the K40’s reign as the most advanced GPGPU hardware is coming to an end, and Kepler will shortly be superseded by Maxwell-class cards.

Actually achieving optimal performance on diverse GPU architectures can be challenging, since it relies on the implementation of carefully-crafted kernels that incorporate extensive knowledge of the underlying hardware and which take full advantage of relevant features of the CUDA programming model. This places a considerable burden on the CUDA developer seeking to port her application to a new generation of GPUs or looking to ensure performance across a range of architectures.

Fortunately, many CUDA applications are formulated in terms of a small set of primitives, such as parallel reduce, scan, or sort. Before attempting to handcraft these primitive operations ourselves, we should consider using one of the libraries of optimized primitives available to CUDA developers. Such libraries include Thrust and CUDPP, but in this post, we will focus on the CUB library developed by Duane Merrill of NVIDIA Research. CUB – the name derives from “CUDA Unbound” – provides generic high-performance primitives targeting multiple levels of application development. For example, CUB supports a set of device-wide primitives, which are called from the host, and in this regard, the functionality provided by CUB overlaps with Thrust to some degree. However, unlike Thrust, CUB also provides a set of kernel components that operate at the thread-block and thread-warp levels.

Thread-block reduction – a simple CUB example

A key feature of the CUB library, and one that makes CUB an attractive option for a wide range of performance-critical applications, is the fact that software components are not specialized for a particular GPU architecture or problem type. CUB is a C++ template library which utilizes policy-based design to provide highly-configurable kernel components that can be tuned for different GPU architectures and applications. To see what exactly this means, let’s consider how we might implement a reduction kernel using CUB. The library includes a templated BlockReduce class to perform reduction operations across a single thread block. It is declared as follows:

template <typename T, int BLOCK_SIZE, BlockReduceAlgorithm ALGORITHM> class BlockReduce;

T denotes the type of data on which the reduction operation is performed, BLOCK_SIZE is the number of threads in the thread block, and BlockReduceAlgorithm is an enumeration of different algorithms that can be used to perform the reduction. Note that the binary operation that specifies the type of reduction being performed (which, more often than not, involves computing the sum or the maximum or minimum of a data set) is not included in the class declaration. Using the BlockReduce class and atomic operations, a kernel to compute the maximum value in an array of integers can be implemented as follows:

1)  template<int BLOCK_SIZE, BlockReduceAlgorithm ALGORITHM>
2)  __global__ 
3)  void maxKernel(int* max, int* input)
4)  { 
5)   int id=blockIdx.x*blockDim.x + threadIdx.x; 
6)   typedef cub::BlockReduce<int,BLOCK_SIZE,ALGORITHM> BlockReduceT; 
7)
8)   // Allocate temporary storage in shared memory 
9)   __shared__ typename BlockReduceT::Temp temp_storage; 
10)
11)  int val=input[id]; 
12)  int block_max=BlockReduceT(temp_storage).Reduce(val,cub::Max());
13)
14)  // update global max value
15)  if(threadIdx.x == 0) atomicMax(max,block_max); 
16)
17)  return;  
18) }

Line 9 of the kernel above allocates temporary storage in device shared memory for use in the reduction algorithm. The quantity and layout of this storage depend on the choice of algorithm, the type of the data, the number of threads per block, and the target GPU architecture. The optimal shared-memory configuration, which provides sufficient temporary storage and avoids unnecessary bank conflicts, is determined at compile time using the template arguments selected in the client code. However, the shared-memory configuration details themselves are hidden from the client application.

On line 12 of our kernel, the BlockReduceT constructor (which takes as an argument the temporary storage allocated above) is called, generating a temporary object, which then invokes its Reduce method. The second argument in the Reduce method is an instance of CUB’s Max functor class. This class is defined such that if maxObject is an instance of the class Max, then maxObject(a,b) returns the maximum of a and b. Other binary operations supported in CUB include binary addition, the binary min operation, and variants of max and min that identify the position of the first occurance of the maximum or minimum value in a data array. The result of a thread-block reduction is returned to the first thread of each block (which has threadIdx.x == 0). Finally, each thread block calls a single atomic operation to update the global maximum. Note that line 15 of the kernel assumes that the value pointed to by max is initialized to some minimum value before the kernel is launched.

Optimizing performance by limiting concurrency

Currently, CUB supports three different block-reduction algorithms, corresponding to the enumerators BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_REDUCE_RAKING, and BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. The latter algorithm is specialized for commutative binary operations (such as the Max operation in our example, where the relative ordering of inputs does not affect the output), while the other algorithms also support non-commutative binary operators. The background to these algorithms is decribed in detail in a series of publications by Merrill and collaborators. As described in those papers, a core feature of the algorithms used in CUB is that they balance concurrency with serial computation in order maximize performance on GPU hardware. In contrast, earlier algorithms targeting GPU architectures tended to involve high levels of concurrency, where the number of logical threads assigned to a problem scales with the problem size. However, in reduction and scan calculations, logical threads have to share data and synchronize with each other, and the cost of this inter-thread cooperation scales with the amount of concurrency in the algorithm (i.e., it depends on the number of threads). Performance can be improved by choosing a level of concurrency that ensures that the GPU hardware is fully utilized while minimizing communication and synchronization overheads.

To understand how the CUB routines utilize serial processing, consider the raking block-reduction algorithms mentioned above. In these algorithms, after an initial step, which we discuss below, each thread in the block writes data to shared memory. At this point, a single warp of threads “rakes” the shared-memory array, with each thread in the warp performing a serial reduction on some segment of the data in shared memory. At the end of this step, a single warp-width of data remains to be reduced, and one warp-level reduction completes the calculation. Further serialization can be achieved by having each thread in the thread block perform a serial partial reduction in registers at the beginning of the block-level reduction routine. To do this, we modify our reduction kernel as follows:

template<int VALS_PER_THREAD, int BLOCK_SIZE, 
         BlockReduceAlgorithm ALGORITHM> 
__global__ 
void maxKernel(int* max, int* input) 
{ 
  int id=blockIdx.x*blockDim.x + threadIdx.x; 
  typedef cub::BlockReduce<int,BLOCK_SIZE,ALGORITHM> BlockReduceT;

  // Allocate temporary storage in shared memory 
  __shared__ typename BlockReduceT::Temp temp_storage; 

  // Assign multiple values to each block thread 
  int val[VALS_PER_THREAD]; 

  // Code to initialize the val array has been omitted 
  int block_max=BlockReduceT(temp_storage).Reduce(val,cub::Max()); 

  // update global max value 
  if(threadIdx.x == 0) atomicMax(max,block_max); 

  return; 
}

Thus, each thread in the block reduces VALS_PER_THREAD items in registers as an initial step in the block reduction.

It’s worth noting that on devices of compute capability 3.0 and above CUB will utilize CUDA’s shuffle feature to perform efficient warp-level reductions, but it reverts to a shared-memory implementation on older hardware (for a recent description of reductions using shuffle, see this Parallel ForAll blog post).

CUDA Unbound

The block reduction example illustrates the extreme configurability of CUB. In essence, CUB provides an outline of the reduction algorithm, but leaves performance-critical details, such as the exact choice of algorithm and the degree of concurrency unbound and in the hands of the user. These parameters can be tuned in order maximimize performance for a particular architecture and application. Since the parameter values are specified in the client application at compile time, this flexibility incurs no runtime performance penalty. The CUB library provides most benefit if integrated into a client-application auto-tuning procedure. In this case, on each new architecture and problem type, the client application would launch a series of short jobs to explore the CUB tuning space and determine the choice of template arguments that optimize performance.

Although we have focused solely on CUB’s support for block-wide reductions in this post, the library also includes highly-configurable scan and sort implementations. Among the other primitives implemented in CUB are block-wide data-exchange operations and parallel histogram calculations, and all of these implementations are flexible enough to ensure high performance in diverse applications running on a range of NVIDIA architectures.

Well, that completes our brief introduction to the CUB library. We’ll revisit CUB in our next post, when we’ll look at concrete performance figures for kernels utilizing different CUB routines on a variety of problems and different GPU architectures.

Introducing CUDA UnBound (CUB)

CUB – a configurable C++ template library of high-performance CUDA primitives

Thread-block reduction – a simple CUB example

Optimizing performance by limiting concurrency

CUDA Unbound

You May Also Like

DGX A100 review: Throughput and Hardware Summary

Deploying GPUs for Classroom and Remote Learning

What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science