GPU Performance without GPU Coding

Eliot Eshelman

·

January 13, 2012

I think everyone in the HPC arena has heard plenty about GPUs. GPUs aren’t sophisticated like CPUs, but they provide raw performance for those who know how to use them. The question for those who have large computational workloads has been: Do I have the time, energy and know-how to take advantage of GPUs?

NVIDIA and PGI are hoping to demonstrate that almost anyone can succeed. The new OpenACC directives standard allows the compiler to accelerate code without a complete re-write or digging into CUDA. Your application can be running twice as fast (or more) with less than a month of work. And if you register with Microway, you can easily use our 8-GPU Tesla SimCluster to recompile your application for GPUs and see the speedup! This post will walk you through the process.

Get the PGI Accelerator Compilers
Install the PGI Compilers
Accelerate Your Application
Deploy Your Accelerated Application

Get the PGI Accelerator Compilers

PGI Compilers are now available in a Community Edition.

If you plan to use Microway’s GPU cluster for benchmarking, skip past this section. We’ll take care of these for you, so you can get right to accelerating your code.

If you’re not working with us you will need an NVIDIA GPU with CUDA capability, which is almost any graphics card or GPU released since 2008. Even many laptops include support, so most users should find that they already own a compatible GPU.

Install the PGI Compilers

As mentioned above, skip down to accelerating your code if you’ll be using Microway’s Tesla SimCluster for testing. We already have the PGI Accelerator Compilers installed.

Windows and MacOS installers should be very straightforward. The process on Linux is also fairly painless, although you’ll need root priviledges:

[root@md ~]$ mkdir pgi
[root@md ~]$ cd pgi
[root@md pgi]$ tar zxvf ../pgilinux-1110.tar.gz
[root@md pgi]$ ./install

Once the installation script is running, follow these steps:

Accept the license terms.
Select ‘Single system install’ (#1).
Choose /opt/pgi/ for the installation directory.
Do not install ACML.
If you have not already installed NVIDIA CUDA, then select yes.
Do not install Java.
Do create links in 2011 directory.
Do not install MPICH1.
Do not generate license keys.
Pause here to take note that PGI has saved your system information, including hostid and hostname. You’ll need this information when you generate the trial license keys, so you can copy & paste the information now or get it later from the /opt/pgi/license.info text file.
There’s no need to make the installation directory read-only.

You then need to make the compilers accessible (these settings can also be written to a file /etc/profile.d/pgi.sh [or pgi.csh] to make them permanent for all users):

For csh:

  % setenv PGI /opt/pgi
  % set path=(/opt/pgi/linux86-64/11.10/bin $path)
  % setenv MANPATH "$MANPATH":/opt/pgi/linux86-64/11.10/man
  % setenv LM_LICENSE_FILE "$LM_LICENSE_FILE":/opt/pgi/license.dat

For bash, sh or ksh:

  $ PATH=/opt/pgi/linux86-64/11.10/bin:$PATH
  $ export PATH 
  $ MANPATH=$MANPATH:/opt/pgi/linux86-64/11.10/man
  $ export MANPATH
  $ LM_LICENSE_FILE=$LM_LICENSE_FILE:/opt/pgi/license.dat
  $ export LM_LICENSE_FILE

If all has been properly configured, you will be able to run a test compilation (without a source file) to verify that the compilers run and the license is detected:

[root@md ~]# pgcc -V x.c

pgcc 11.10-0 64-bit target on x86-64 Linux -tp nehalem 
Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
Copyright 2000-2011, STMicroelectronics, Inc.  All Rights Reserved.
NOTE: your trial license will expire in 14 days, 0.578 hours.
PGC-F-0002-Unable to open source input file: x.c
PGC/x86-64 Linux 11.10-0: compilation aborted

Accelerate your Application

If you plan to test using the Microway Tesla SimCluster then you get to skip the work of installing the PGI compilers, but you will need to click here to submit a benchmark request.

The first step to acceleration can be performed before you have access to any GPUs: you must determine which parts of your application hog all the CPU time. In many cases, most of the code is supporting a fairly small section which performs the actual computation. If you have a large code base with which you’re not familiar, then a quick run with a profiler will give you details about which portion of the code does most of the work. Note that the amount of time spent in each section of code is much more important than how many times a particular portion is called.

The second step is to insert the OpenACC Directives (sometimes called pragmas) above and below the critical sections of code. The OpenACC standard allows for a large amount of information to be passed to the compiler, but in many cases a simple statement is sufficient. Here are two example pieces of code in C and Fortran.

Accelerated Fortran Snippet (matrix multiplication):

!$acc region 
      do k = 1,n1
       do i = 1,n3
        c(i,k) = 0.0
        do j = 1,n2
         c(i,k) = c(i,k) + a(i,j) * b(j,k)
        enddo
       enddo
      enddo
!$acc end region

Accelerated C Snippet (calculating PI):

    double pi = 0.0f; long i;
    #pragma acc region for
    for (i=0; i<N; i++)
    {
        double t = (double) ((i+0.5)/N);
        pi += 4.0/(1.0+t*t);
    }

There are already many good resources for those wishing to try out Directives, including tutorial videos/webinars, presentations, a case study of accelerating WRF, specification white papers and more. Start with OpenACC at NVIDIA’s Developer portal..

Astute readers may have already noticed that none of these Directives are specific to GPUs – they are nothing more than comments in the code. Therefore, it’s possible to use a single code base for both CPU-only and CPU+GPU systems. You can put your newly-accelerated code into production now, but won’t get the real speedups until you add the Tesla GPUs.

Deploy Your Accelerated Application

Once you’ve seen the speedup you can achieve with GPUs, it’s a question of integrating GPUs into your current workflow and production systems. You may feel the most straightforward approach is simply to have Microway install a SimCluster for you – we can provide an exact match to what you tested with both hardware and software ready-to-go. We also offer a variety of customizable workstation and server systems to fit almost any need. Contact us to discuss your requirements and we’ll make a recommendation.

Implementing NVIDIA AI Blueprint

Microway Achieves DGX SuperPOD Specialization Partner Status with NVIDIA

DGX A100 review: Throughput and Hardware Summary