Logo of the OpenACC standard for Accelerator Directives

More Tips on OpenACC Acceleration

One blog post may not be enough to present all tips for performance acceleration using OpenACC. So here, more tips on OpenACC acceleration are provided, complementing our previous blog post on accelerating code with OpenACC.

Further tips discussed here are:

  • linearizing a 2D array
  • usage of contiguous memory
  • parallelizing loops
  • PGI compiler information reports
  • OpenACC general guidelines
  • the OpenACC runtime library

More Runtime Enhancements

Using a Linearized Array Instead of a 2D Array

Using a linearized array for dynamically allocated data structures having more than one dimension is a common approach, and requires a simpler implementation that non-linearized 2D arrays. However, with OpenACC, the mixing of 2D indices required for linearizing the array index can lead to problems with compilation. The following code, for example, will lead to the warning messages shown below:

float* MatrixMult_linearIdx(int size, float *restrict A, float 
                          *restrict B, float *restrict C) {
   int idx1, idx2, idx3;
   float tmp;

   for (int i=0; i<size; ++i) {
     for (int j=0; j<size; ++j) {
       tmp = 0.;
       for (int k=0; k<size; ++k) {
          idx1 = i*size + k;
          idx2 = k*size + j;
          tmp += A[idx1] * B[idx2];
       }
       idx3 = i*size + j;
       C[idx3] = tmp;
     }
   }
   return C;
}
float* MakeMatrix_linearIdx(int size, float *restrict arr) {
    int i, j, idx;
    arr = (float *)malloc( sizeof(float) * size * size);
    for (i=0; i<size; i++){
       for (j=0; j<size; j++){
          idx = i*size + j;
          arr[idx] = ((float)i);
       }
    }
    return arr;
}
void fillMatrix_linearIdx(int size, float *restrict A) {
   int idx;
   for (int i = 0; i < size; ++i) {
      for (int j = 0; j < size; ++j) {
        idx = i*size + j;
        A[idx] = ((float)i);
      }
   }
}

During compilation, the compiler will print this message:
Complex loop carried dependence of '*(A)' prevents parallelization
Parallelization would require privatization of array 'A[:size*size-1]'

To correct this, one can either: 1.) chose to not linearize the array index, and use arrays with two indices, or 2.) use the independent clause to tell the compiler that there really is no dependence, even if it detects otherwise. In the case of the linearized index presented here, the dependence detected by the compiler turns out to be a removable obstacle to parallelization, since the loops are actually independent. The two code samples below reflect these two solutions.

1.) choosing to not linearize the array index, and instead use two array indices

float** MatrixMult(int size, float **restrict A, float **restrict B,
        float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])
{
   float tmp;
   for (int i=0; i<size; ++i) {
      for (int j=0; j<size; ++j) {
         tmp = 0.;
         for (int k=0; k<size; ++k) {
            tmp += A[i][k] * B[k][j];
         }
         C[i][j] = tmp;
      }
   }
}
   return C;
}

2.) using the independent clause to tell the compiler that there really is no dependence between loops. When selecting this option, the programmer must be confident that the loops are independent, or else the program will behave unexpectedly.

float* MatrixMult_linearIdx(int size, float *restrict A,
        float *restrict B, float *restrict C) {
   // uses linearized indices for matrices
   int idx1, idx2, idx3;
   float tmp;

#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])
{
#pragma acc for independent
   for (int i=0; i<size; ++i) {
#pragma acc for independent
     for (int j=0; j<size; ++j) {
       tmp = 0.;
#pragma acc for independent
       for (int k=0; k<size; ++k) {
          idx1 = i*size + k;
          idx2 = k*size + j;
          tmp += A[idx1] * B[idx2];
       }
       idx3 = i*size + j;
       C[idx3] = tmp;
     }
   }
} // pragma acc kernels
   return C;
} // MatrixMult_linearIdx()

The Use of Contiguous Memory for Dynamically Allocated Data Structures

The dynamic allocation of data structures, such as multi-dimensional arrays to heap memory is an important memory management feature of the C language. Through the use of pointers, and pointers to pointers, etc., one and two-dimensional data structures can be allocated chunks of memory in an adjustable region of memory which the OS uses, called the heap. This flexible memory region allows for dynamic allocation and removal of data structures as a program executes, allowing the programmer some finer level of control over what memory is used by a program.

With OpenACC, the relevant consideration is with 2D or higher-than-2D data structures, such as arrays. The data transfer for 2D data structures can be slowed down considerably if the second order pointers are allocated using separate calls to malloc(). Put another way, each call to malloc() is assigned some region of heap memory which is not related in position to the position of previously assigned heap memory. Therefore, a memory assignment for a 2D array such as in the following procedure will result in a collection of non-contiguous blocks of memory comprising the total array.

float** MakeMatrix_nonContiguous(int size, float **restrict arr) {
// assigns memory non-contiguously in a 2D array
    int i;
    arr = (float **)malloc( sizeof(float *) * size);
    for (i=0; i<size; i++){
       arr[i] = (float *)malloc( sizeof(float) * size );
    }
    return arr;
}

In the code snippet above, with each iteration of the i loop, an entirely new call to malloc is made.

By making one call to malloc(), allocating the entire space needed for the 2D array, and then setting pointer positions explicitly in a loop, such as in the following procedure, a contiguous memory assignment is made.

float** MakeMatrix(int size, float **restrict arr) {
// assigns memory contiguously in a 2D array
    int i;
    arr = (float **)malloc( sizeof(float *) * size);
    arr[0] = (float *)malloc( sizeof(float) * size * size);
    for (i=1; i<size; i++){
       arr[i] = (float *)(arr[i-1] + size);
    }
    return arr;
}

Because this 2D array is assigned to a contiguous memory block, there will just be one memory transfer across the PCIe bus, when sending it to or from the GPU.

In the non-contiguous implementation, however, each row is assigned memory with a separate malloc() statement, and there are N rows. This will result in N copies being sent across the PCIe bus, instead of just one. The data transfer to and from device will be therefore be faster for memory-contiguous arrays.

To get a sense of how much faster the memory transfer speed is, the transfer time for square 1000x1000 Matrices was measured over five iterations with contiguous and non-contiguous arrays using a Tesla M40. The transfer times are shown below in Table 1.

memory
assignment
PCIe xfer time
data copyin
PCIe xfer time
data copyout
contiguous 702 us 316 us
non-contiguous 8,302 us 5,551 us

Table 1. PCIe transfer of contiguous vs. non-contiguous arrays results in faster data transfers

A speedup of at least 11x was achieved by assigning matrix data to contiguous arrays.

show runtime output with data copy times for contiguous memory assigned data
[john@node6 MD_openmp]$ ./matrix_ex_float 1000 5
./matrix_ex_float total runtime 0.055342

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 52,456
    19: compute region reached 5 times
        26: kernel launched 5 times
            grid: [63x16]  block: [16x32]
             device time(us): total=52,456 max=10,497 min=10,488 avg=10,491
            elapsed time(us): total=52,540 max=10,514 min=10,504 avg=10,508
    19: data region reached 5 times
    35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float.c
  main  NVIDIA  devicenum=0
    time(us): 1,037
    96: data region reached 1 time
        31: data copyin transfers: 2
             device time(us): total=702 max=358 min=344 avg=351
        31: kernel launched 3 times
            grid: [8]  block: [128]
             device time(us): total=19 max=7 min=6 avg=6
            elapsed time(us): total=489 max=422 min=28 avg=163
    128: data region reached 1 time
        128: data copyout transfers: 1
             device time(us): total=316 max=316 min=316 avg=316

show runtime output with data copy times for non-contiguous memory assigned data
[john@node6 MD_openmp]$ ./matrix_ex_float_non_contig 1000 5
./matrix_ex_float_non_contig total runtime 0.059821

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float_non_contig.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 51,869
    19: compute region reached 5 times
        26: kernel launched 5 times
            grid: [63x16]  block: [16x32]
             device time(us): total=51,869 max=10,378 min=10,369 avg=10,373
            elapsed time(us): total=51,967 max=10,398 min=10,389 avg=10,393
    19: data region reached 5 times
    35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float_non_contig.c
  main  NVIDIA  devicenum=0
    time(us): 31,668
    113: data region reached 1 time
        31: data copyin transfers: 2000
             device time(us): total=8,302 max=27 min=3 avg=4
        31: kernel launched 3000 times
            grid: [1]  block: [128]
             device time(us): total=17,815 max=7 min=5 avg=5
            elapsed time(us): total=61,892 max=422 min=19 avg=20
    145: data region reached 1 time
        145: data copyout transfers: 1000
             device time(us): total=5,551 max=28 min=5 avg=5

Tips for Parallelizing Loops

Parallelizing iterative structures can trigger warnings, and re-expressing loop code is sometimes required. For example, if the programmer uses a directive such as kernels, parallel, or region, and if the compiler sees any dependency between loops, then the compiler will not parallelize that section of code. By expressing the same iteration differently, it may be possible to avoid warnings and get the compiler to accelerate the loops. The code samples below illustrate loops which will cause the compiler to complain.

#pragma acc region
{
   while (i<N && found == -1) {
      if (A[i] >= 102.0f) {
         found = i;
      }
      ++i;
   }
}

Compiling the above code will generate the following warning from the compiler:
Accelerator restriction: loop has multiple exits
Accelerator region ignored

The problem here is that i could take on different values when the while loop is exited, depending on when an executing thread samples a value of A[i] greater than, or equal to 102.0. The value of i would vary from run to run, and will not produce the result the programmer intended.

Re-expressed in the for loop below, containing branching logic, the compiler now will see the first loop as being parallelizable.

#pragma acc region
{
   for (i=0; i<N; ++i) {
      if (A[i] >= 102.0f) {
         found[i] = i;
      } 
      else {
         found[i] = -1;
      }
   }
}
i=0;
while (i < N && found[i] < 0) {
   ++i;
}

Although this code is slightly longer, with two loops, accelerating the first loop makes up for the separation of one loop into two. Normally separating one loop into two is bad for performance, but when expressing parallel loops, the old rules might no longer apply.

Another potential problem with accelerating loops is that inner loop variables may sometimes have to be declared as private. The loops below will trigger a compiler warning:

#pragma acc region
{
   for (int i=0; i<N; ++i) {
      for (int j=0; j<M; ++j) {
         for (int k=0; k<10; ++k) {
            tmp[k] = k;
         }
         sum=0;
         for (int k=0; k<10; ++k) {
            sum+=tmp[k];
         }
         A[i][j] = sum;
      }
   }
} 

The warning message from the compiler will be:
Parallelization would require privatization of array tmp[0:9]
Here, the problem is that the array tmp is not declared within the parallel region, nor is it copied into the parallel region. The outer i loop will run in parallel but the inner j loop will run sequentially because the compiler does not know how to handle the array tmp.

Since tmp is not initialized before the parallel region, and it is re-initialized before re-use in the region, it can be declared as a private variable in the region. This means that every thread will retain its own copy of tmp, and that the inner loop j can now be parallelized:

#pragma acc region
{
   for (int i=0; i<N; ++i) {
#pragma acc for private(tmp[0:9])
      for (int j=0; j<M; ++j) {
         for (int ii=0; ii<10; ++ii) {
            tmp[ii] = ii;
         }
         sum=0;
         for (int ii=0; ii<10; ++ii) {
            sum += tmp[ii];
         }
         A[i][j] = sum;
      }
   }
}

Compiler Information Reports

Compiling with Time Target (target=nvidia,time)

Compiling the program with time as a target causes the execution to report data transfer times to and from the device, as well as kernel execution times. Data transfer times can be helpful in detecting whether large portions of runtime might be spent on needless data transfers to and from the host. This is a common loss of speedup when first learning OpenACC. To compile with the “time” target option, use the compiler syntax:
pgcc -ta=nvidia,time -acc -Minfo -o test test.c

The data transfer times reported below are shown for one version of the program, where no data region is established around the loop in main(), while only kernels are used in MultMatrix().

show runtime output with kernel times, grid size, block size
[john@node6 openacc_ex]$ ./matrix_ex_float 1000 5
./matrix_ex_float total runtime  0.46526

Accelerator Kernel Timing data
/home/john/openacc_ex/./matrix_ex_float.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 114,979
    29: compute region reached 5 times
        32: kernel launched 5 times
            grid: [8x1000]  block: [128]
             device time(us): total=109,238 max=21,907 min=21,806 avg=21,847
            elapsed time(us): total=109,342 max=21,922 min=21,825 avg=21,868
    29: data region reached 5 times
        31: data copyin transfers: 10
             device time(us): total=3,719 max=394 min=356 avg=371
        31: kernel launched 15 times
            grid: [8]  block: [128]
             device time(us): total=76 max=6 min=5 avg=5
            elapsed time(us): total=812 max=444 min=23 avg=54
    40: data region reached 5 times
        40: data copyout transfers: 5
             device time(us): total=1,946 max=424 min=342 avg=389

With kernels used in MatrixMult() and data region declared around iterative loop in main():

show runtime output
[john@node6 openacc_ex]$ ./matrix_ex_float 1000 5
./matrix_ex_float total runtime  0.11946

Accelerator Kernel Timing data
/home/john/openacc_ex/./matrix_ex_float.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 111,186
    29: compute region reached 5 times
        32: kernel launched 5 times
            grid: [8x1000]  block: [128]
             device time(us): total=109,230 max=21,903 min=21,801 avg=21,846
            elapsed time(us): total=109,312 max=21,918 min=21,818 avg=21,862
    29: data region reached 5 times
        31: kernel launched 5 times
            grid: [8]  block: [128]
             device time(us): total=25 max=5 min=5 avg=5
            elapsed time(us): total=142 max=31 min=27 avg=28
    40: data region reached 5 times
        40: data copyout transfers: 5
             device time(us): total=1,931 max=398 min=372 avg=386
/home/john/openacc_ex/./matrix_ex_float.c
  main  NVIDIA  devicenum=0
    time(us): 790
    176: data region reached 1 time
        31: data copyin transfers: 2
             device time(us): total=779 max=398 min=381 avg=389
        31: kernel launched 2 times
            grid: [8]  block: [128]
             device time(us): total=11 max=6 min=5 avg=5
            elapsed time(us): total=522 max=477 min=45 avg=261
    194: data region reached 1 time

When the data region is established around the iterative loop in main(), the boundary of the parallel data context is pushed out, so that the A and B matrices are only copied into the larger region once, for a total of two copyin transfers. In the former case, with no data region, the A and B matrices are copied at each of five iterations, resulting in a total of 10 copyin transfers.

Compiling with PGI_ACC_* Environment Variables

The environment variables PGI_ACC_NOTIFY and PGI_ACC_TIME provide timing information in the shell of execution.

PGI_ACC_NOTIFY
If set to 1, the runtime prints a line of output each time a kernel is launched on the GPU.
If set to 2, the runtime prints a line of output about each data transfer.
If set to 3, the runtime prints out kernel launching and data transfer.

PGI_ACC_TIME
By setting PGI_ACC_TIME to 1, during runtime, a summary will be made of time taken for data movement between the host and the GPU, as well as kernel computation on the GPU.

Other OpenACC General Guidelines

1.) Avoid using pointer arithmetic. Instead, used subscripted arrays, rather than pointer-indexed arrays. Pointer arithmetic can confuse compilers.

2.) When accelerating C code, compilers will not parallelize a loop containing data structures which are referenced by a pointer, unless that pointer is declared as restricted. This can be done at the input-argument of a routine. If called within the main body, instead of a routine, the original pointer declaration must be restricted. This way, the compiler can be certain that two pointers do not point to the same memory and that the loop can be parallelized without producing unpredictable results. If pointers are not restricted the compiler will report the warning loop not parallelizable. The code will compile, but the parallel region declared around the unrestricted pointers will be ignored.

The example below, taken from the application code, shows a procedure for assigning contiguous memory for a 2D array, where three of the procedure pass-in variables are cast onto a float **restrict data type. This is a restricted pointer to a pointer. Without the cast to restricted datatype in the list of routine arguments, the compiler would report the loops are not parallelizable.

float** MatrixMult(int size, float **restrict A, float **restrict B,
        float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])
{
   float tmp;
   for (int i=0; i<size; ++i) {
      for (int j=0; j<size; ++j) {
         tmp = 0.;
         for (int k=0; k<size; ++k) {
            tmp += A[i][k] * B[k][j];
         }
         C[i][j] = tmp;
      }
   }
} // #pragma
   return C;
}

3.) It is possible to do conditional compilation with _OPENACC macro. _OPENACC can be set to a value equal to the version number of OpenACC you wish to use.

4.) Routines called within a directive region must be inlined, when the program is run from the command line. Commonplace C math library functions, such as rand() from math.h, for example, when present in a loop, will present a problem if the loop needs to be parallelized. The only solution, other than removing the function call, and replacing it with something else which can be parallelized, is to inline the function when compiling at the command line. This identifies it as parallelizable, but under the proviso that it be run sequentially. For example, the following code region, if identified as a parallel region using compiler directives, would require inlining of the rand() routine call:

void initialize(int np, int nd, vnd_t box, vnd_t *restrict pos) {
   int i, j;
   double x;
   srand(4711L);
#pragma acc kernels 
   for (i = 0; i<np; i++) {
      for (j = 0; j<nd; j++) {
         x = rand() % 10000 / (double)10000.0;
         pos[i][j] = box[j] * x;
      }
   }
}

Here, the inner j loop would run sequentially and slow, but the outer i loop would run in parallel.
The inlining of rand() requires a command-line flag of the form:
pgcc -acc -ta=nvidia -Minfo -Minline=func1,func2 -o ./a.out ./a.c

Since we are inlining only one function, rand(), the inlining would be:
pgcc -acc -ta=nvidia -Minfo -Minline=rand -o ./a.out ./a.c

If, however, the function to be inlined is in a different file than the one with the accelerated code
region, then the inter-procedural optimizer must be used. Automatic inlining is enabled by specifying
-Mipa=inline on the command-line.

5.) Reduction operations are used across threads to resolve global operations , such as the minimum value in a loop, or the sum total, for example. When a reduction is done on a parallel region, the compiler will usually automatically detect it. Explicitly applying a reduction operation requires that a clause be placed on a loop or parallel directive:

#pragma acc parallel loop pcopyin(m, n, Anew[0:n][0:m], A[0:n][0:m]) reduction(max:error)
for( int j=1; j<n-1; j++){
	for( int i=1; i<m-1; i++){
		Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
					+ A[j-1][i] + A[j+1][i]);
		error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
	}
}

Here, the reduction(max:error) tells the compiler to find the maximum value of error across executing kernels. Reduction operators in OpenACC are: +, *, &&, ||, &, |, ∧, max, min.

OpenACC Runtime Library Routines

OpenACC features a number of routines from the openacc.h header file. An exhaustive list will not be given here, but some functionalities provided by the library include setting which device to use, getting the current device type, and dynamically allocating memory on the device. There are approximately two dozen OpenACC runtime routines, which are described in further detail in The OpenACC Reference Guide [ref. 1].

Background Reading

1.) OpenACC 2.6 Quick Reference
2.) OpenACC 1.0 Quick Reference
3.) The PGI Accelerator Programming Model on NVIDIA GPUs
4.) 11 Tips for Maximizing Performance with OpenACC Directives in Fortran
5.) 12 Tips for Maximum Performance with PGI Directives in C
6.) The OpenACC Application Programming Interface Version 2.0 (July, 2013)

John Murphy

About John Murphy

My background in HPC includes building two clusters at the University of Massachusetts, along with doing computational research in quantum chemistry on the facilities of the Massachusetts Green High Performance Computing Center. My personal interests and academic background encompass a range of topics across science and engineering. In recent work, I used the GAMESS quantum chemistry package in order to study theoretical highly strained hydrocarbon structures, derived from prismane building blocks. I also recently authored a small software application in Python for generating amorphous cellulose within a periodic space. This application was used for generating structures for further study in NAMD and LAMMPS. Prior to doing research in Quantum and Materials Chemistry, I worked on problems related to protein folding and docking. It is very exciting, especially, to be involved with applications of GPU computing to these, as well as to other scientific questions. For several years, while not doing research, I was a consulting software engineer and built a variety of internet and desktop software applications. As an HPC Sales Specialist at Microway, I greatly look forward to advising Microway's clients in order to provide them with well-configured, optimal HPC solutions.
This entry was posted in Benchmarking, Software and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published.