Parallel Code: Maximizing your Performance Potential

justin.p.mckennon

June 3, 2013

No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.

Read: CUDA Parallel Thread Management, Divergence and Profiling

Host/Device Transfers and Data Movement – Topic #2

Transferring data between the host and device is a very costly move. It is not uncommon to have code making multiple transactions between the host and device without the programmer’s knowledge. Cleverly structuring code can save tons of processing time! On top of that, it is imperative to understand the cost of these host device transfers. In some cases, it may be more beneficial to run certain algorithms or pieces of code on the host due to the costly transfer time associated with farming data to the device.

Read: Profile CUDA Host-to-Device Transfers and Data Movement
Read: Optimize CUDA Host-to-Device Transfers

Cache and Shared Memory Optimizations – Topic #3

In addition to managing the threads running in your application, properly utilizing the various memory types available on your device is paramount to ensuring that you’re squeezing every drop of performance from your application. Shared memory, local memory, and register memory all have their advantages and disadvantages and need to be used very carefully to avoid wasting valuable clock cycles. Phenomena such as bank conflicts, memory spilling (too much data being placed in registers and spilling into local memory), improper loop unrolling, as well as the amount of shared memory, all play pivotal roles in obtaining the greatest performance.

Read: GPU Memory Types and Memory Performance Comparison
Read: GPU Shared Memory Performance Optimization
Read: Avoiding GPU Memory Performance Bottlenecks

More to come…

All in all, utilizing devices like the NVIDIA GPU, Cell BE or Intel Xeon Phi to increase the performance of your application doesn’t have to be a daunting task. Over the next several posts, this blog will outline and identify effective techniques to make troubleshooting the performance leaks of your application an easy matter. Each of these common bottlenecks will be discussed in detail in an effort to provide programmers insight into how to make use of all the resources that many popular architectures provide.

Implementing NVIDIA AI Blueprint

Microway Achieves DGX SuperPOD Specialization Partner Status with NVIDIA

DGX A100 review: Throughput and Hardware Summary