CUDA Host/Device Transfers and Data Movement

This post is Topic #2 (part 1) in our series Parallel Code: Maximizing your Performance Potential.

In post #1, I discussed a few ways to optimize the performance of your application via controlling your threads and provided some insight as to how to go about fixing some possible thread related issues in your application. In this post and the following one, I will discuss another possible major performance bottleneck: Host/Device Transfers and Data Movement.

Profiling Your CUDA Code for Timing Data

In a standard CUDA application, several steps typically occur:

  1. Allocate memory on the device
  2. Copy data from host to device
  3. Perform some calculations
  4. Copy Data from Device to Host
  5. Free the allocated device memory
  6. Rinse and Repeat

In the above list, steps 2 and 4 are an absolute necessity in every CUDA application, but are also HUGE performance robbers as well. These transfers are the slowest portion of data movement involved in any aspect of GPU computing. The actual transfer speed (bandwidth) is dependent on the type of hardware you’re using, but regardless of this point, it is still the slowest. In the example code below, I will illustrate this point:

int main()
    const unsigned int X=1048576; //1 Megabyte
    const unsigned int bytes = X*sizeof(int);
    int *hostArray= (int*)malloc(bytes);
    int *deviceArray;



In this example, there are no operations being run on the device. The data is simply copied from the host to the device and back. I’ve named this program profilerExample.cu. To profile this code, it simply needs to be compiled with nvcc and then run with nvprof (nvprof is new in CUDA 5 – the older command line profiler can still be used in earlier versions of CUDA):

$ nvcc profilerExample.cu -o profileExample

$ nvprof ./profileExample
======== NVPROF is profiling profileExample.out...
======== Command: profileExample.o
======== Profiling result:
Time(%)     Time  Calls      Avg      Min      Max Name
  50.08 718.11us      1 718.11us 718.11us 718.11us [CUDA memcpy DtoH]
  49.92 715.94us      1 715.94us 715.94us 715.94us [CUDA memcpy HtoD]

On my desktop I run a GTX 680 graphics card. As you can see from the above results, a simple copy operation to/from the GPU takes in excess of 715 microseconds each way (a lifetime in terms of computation time). In complex applications with larger amounts of data going back and forth between the host and device many times, this can result in significant time being wasted on these transfers.

Alternative Profiling Options Using Timers

In addition to the nvprof profiler, any CPU timer can be used to measure the elapsed time of a CUDA call/function or kernel execution. It is important to note that if you’re using a CPU timer to measure the timing performance of a portion (or all) or your application, that many of the CUDA functions are asynchronous. This means that the function returns control to the associated thread prior to completing all of their work. If you’re using a CPU timer you must synchronize the CPU thread associated with the timer with the device by calling cudaDeviceSynchronize() immediately before starting and stopping the CPU timer. This blocks the CPU threads until all the CUDA calls issued by that thread have been completed. CUDA also provides its own method for timing using events. The following example code snippet illustrates how to use the CUDA event timers to profile your code:

cudaEvent_t startTime, stopTime;

float time;







cudaEventElapsedTime(&time, startTime, stopTime);



In this example, the cudaEventRecord() function call places the startTime and stopTime events into the default execution stream, ‘0’. The device records a timestamp for the event when it reaches that event in the execution stream. cudaEventElapsedTime() simply returns the time in milliseconds (with roughly 0.5us resolution) between the events.

Importance of Data Transfers in CUDA Applications

Analyzing these timing results can prove to be hugely beneficial in determining which portions of your application are the most expensive in terms of time. While there are a number of factors that can make one portion of code more expensive in terms of time, a good way to increase the performance of your application is to minimize the host/device transfers.

The peak theoretical bandwidth between device memory and the device processor is significantly higher than the peak theoretical bandwidth between the host memory and device memory. Therefore, in order to get the most bang for your buck in your application, you really need to minimize these host<->device data transfers. Many programmers are unaware of the high overhead associated with these transfers and by intelligently reducing or eliminating them, you can see very large gains in performance. Try performing a ‘before and after’ type test with your code. If you have multiple transfers occurring throughout your application, try reducing this number and observe the results.

The next post in this series will identify effective ways to optimize your code and avoid numerous transfers between the host and device. Utilizing pinned/mapped memory, asynchronous transfers, and overlapping transfers with computations can yield lofty performance gains if you have many host/device transfers occurring in your application.

More information about nvprof can be located at NVIDIA’s Developer Zone:
CUDA Toolkit Documentation – Profiler User’s Guide

You May Also Like