Optimize CUDA Host/Device Transfers

This post is Topic #2 (part 2) in our series Parallel Code: Maximizing your Performance Potential.

In my previous post, CUDA Host/Device Transfers and Data Movement, I provided an introduction into the bottlenecks associated with host/device transfers and data movement. This post will delve a bit further into the subject and provide a few nifty ways to mitigate these very costly operations.

In every single CUDA application (well any useful ones, that is) there is at the very least one host-to-device transfer and one device-to-host transfer. More complicated applications often have many transfers between the host and device. In CUDA programming, this is one of the most expensive operations in terms of timing.

So, if these host/device data transfers are so costly, how do you avoid them? Well, you can’t. But what you can do is minimize the number of transfers between host and device in your application, and mask their impact on the performance of your application.

First, any intermediate data structures that are used within your kernel should always be allocated and destroyed solely on the device. This removes the need to map these structures to host memory and removes the need to transfer this data between the host and device.

If your application has multiple host/device transfers, every effort should be made to batch these transfers into one large transfer. I like to think of this as if you were carrying groceries. Why make multiple trips out to the car when you can load up your arms and do it all at once? Most GPUs support transfer speeds between 5GB/sec and 11GB/sec.

For situations where there is no way around transferring data between host and device, more advanced techniques can be employed to lessen the impact on your application: pinned (also known as page-locked, or mapped) memory and asynchronous transfers.

Pinned Memory

The cudaHostAlloc() function allows you to allocate host memory that can be read from the device and written directly to by the device. This allocated memory is called pinned memory. Pinned memory transfers attain the highest bandwidth between the host and device. During execution, a block that requires host data only needs to wait for a small portion of the data to be transferred (when operating through pinned memory). Typical host-to-device copies make all blocks wait until all of the data associated with the copy operation is transferred. Keep in mind, however, that pinning too much memory can degrade overall system performance by reducing the amount of memory available to the system for paging operations. How much memory you can safely pin differs from system to system, so definitely experiment with this to find the optimal amount.

Asynchronous Transfers

Standard host/device transfers are known as blocking transfers. Control of the main thread is returned only after the data transfer is complete. The cudaMemcpyAsync() function is effectively a non-blocking version of the standard cudaMemcpy(). When executing an asynchronous transfer via cudaMemcpyAsync(), control is returned immediately to the main thread. If you’re not jumping up and down with excitement after hearing that, you should be!

Asynchronous transfers required pinned memory and make use of CUDA streams. In CUDA, streams are essentially sequences of operations that are performed in order on the device. Creating multiple streams is a bit more of an advanced CUDA technique, but one that must be learned if you want the most bang for your buck. With multiple streams in a single application, operations within separate streams can be overlapped, providing a great way to mask the host/device transfer time. Let’s look at an example of how using multiple streams can benefit you and your application:

cudaMemcpyAsync(deviceArray,hostArray,size,cudaMemcpyHostToDevice,0);
kernel<<>>(deviceArray);
//your code

Here, both the transfer and kernel are using the default stream, 0. During execution, the kernel will not be launched until the entire copy operation is complete and control has been returned back to the main thread. This is because both the kernel and memory copy are part of the same stream. Now, let’s look at the code using multiple streams:

cudaStreamCreate(&mystream1);
cudaStreamCreate(&mystream2);
cudaMemcpyAsync(deviceArray,hostArray,size,cudaMemcpyHostToDevice,mystream1);
kernel<<>>(otherDataArray);
//your code

By defining two new streams, we are able to make use of concurrent copy and compute. The memory copy is executing in one stream while the kernel is off in another stream, asynchronous from one another. An important note is to make sure that your device supports concurrent copy and execute before you put this in all of your code. This can be done via the deviceOverlap field of the cudaDeviceProp structure.

While this is an advanced technique, if your data can be broken into chunks and transferred in various stages, you can launch multiple kernel instances to operate on each chunk of data as it arrives on the device. Doing so will almost completely mask the transfer time between the host and device.

So, armed with the knowledge of streams, asynchronous transfers, and pinned memory, you now have some insight on how to squeeze out some more performance from your application. My next post will discuss how to efficiently make use of the available memory types accessible to you within your GPU application.

Justin McKennon (for Microway)

About Justin McKennon (for Microway)

My name is Justin McKennon. I'm a 24 year old nerd from Springfield, MA. I'm an electrical engineer by degree but I like to pretend I'm a computer scientist. My hobbies span from studying the mechanics and physics of the golf swing, to exploiting GPUs and hardware accelerators for scientific applications. I've been studying high performance computing since 2008, and I specialize in CUDA and the optimization of parallel algorithms.
This entry was posted in Development and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *