Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs

MATLAB solving a second order wave equation on Tesla GPUs

MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. If you haven’t tried yet, take this opportunity to test MATLAB performance on GPUs. Microway’s GPU Test Drive makes the process quick and easy. As we’ll show in this post, you can expect to see 3X to 6X performance increases for many tasks (with 30X to 60X speedups on select workloads).

Access a Compute Node with GPU-accelerated MATLAB

Getting started with MATLAB on our GPU cluster is easy: complete this form to sign up for MATLAB GPU benchmarking. We will send you an e-mail with detailed instructions for logging in and starting up MATLAB. Once you’re in, all you need to do is click the MATLAB icon and the latest version of GPU-Accelerated MATLAB will pop up:
Mathworks MATLAB R2014b splashscreen

We use NoMachine to export the graphical sessions from our cluster to your local PC/laptop. This makes login extremely user-friendly, ensures your interactive session performs well and provides a built-in method for file transfers in and out of the GPU cluster. MATLAB is fairly well-known for performing sluggishly over standard Unix/Linux graphical sessions (e.g., X11 forwarding, VNC), but you’ll have no such issues here.

You’ll be dropped into a standard MATLAB workspace. A variety of parallelized demonstrations of GPU usage are included with MATLAB. Pick one and give it a try! You can type paralleldemo_gpu and then hit <TAB> to see the full list of options.

Main MATLAB R2014b window

Measure MATLAB GPU Speedups

Below we show the output from several of the built-in MATLAB parallel GPU demos. A few are text-only, but several include a graphical component or performance plot. The first example runs a quick test on memory transfer speeds and computational throughput. Results from both the GPU and the host (CPUs) are shown:

>> paralleldemo_gpu_benchmark
Using a Tesla K40m GPU.
Achieved peak send speed of 3.44069 GB/s
Achieved peak gather speed of 2.20036 GB/s
Achieved peak read+write speed on the GPU: 233.613 GB/s
Achieved peak read+write speed on the host: 12.9773 GB/s
Achieved peak calculation rates of 398.9 GFLOPS (host), 1345.8 GFLOPS (GPU)

Note that the host results will be impacted by the number of local workers available in the Parallel Computing Toolbox. Since version R2011b, the default has been limited to 12 threads/CPU cores. With the release of R2014a, Mathworks removed that limit. For these tests we changed the number of workers to 20 in the Parallel Preferences dialog box.

The next demo generates plots of the speedup between matrix multiplications on dual 10-core Xeon CPUs versus a single NVIDIA Tesla K40 GPU. Both single-precision and double-precision floating-point calculations were run.

Matrix Multiplication Speedups

MATLAB paralleldemo_gpu_benchmark_backslash single-precision GPU matrix multiply speedup

MATLAB paralleldemo_gpu_benchmark_backslash double-precision GPU matrix multiply speedup

MATLAB GPU speedups for paralleldemo_gpu_benchmark_backslash matrix multiplications

Raw MATLAB Output

>> paralleldemo_gpu_backslash
Starting benchmarks with 8 different single-precision matrices of sizes
ranging from 1024-by-1024 to 29696-by-29696.
Creating a matrix of size 1024-by-1024.
Gigaflops on CPU: 66.278709
Gigaflops on GPU: 107.556334
Creating a matrix of size 5120-by-5120.
Gigaflops on CPU: 235.782899
Gigaflops on GPU: 988.360718
Creating a matrix of size 9216-by-9216.
Gigaflops on CPU: 345.775846
Gigaflops on GPU: 1411.722193
Creating a matrix of size 13312-by-13312.
Gigaflops on CPU: 430.923486
Gigaflops on GPU: 1631.047366
Creating a matrix of size 17408-by-17408.
Gigaflops on CPU: 493.923539
Gigaflops on GPU: 1708.917025
Creating a matrix of size 21504-by-21504.
Gigaflops on CPU: 529.809413
Gigaflops on GPU: 1754.558735
Creating a matrix of size 25600-by-25600.
Gigaflops on CPU: 567.786871
Gigaflops on GPU: 1804.538355
Creating a matrix of size 29696-by-29696.
Gigaflops on CPU: 597.913569
Gigaflops on GPU: 1842.050491
Starting benchmarks with 6 different double-precision matrices of sizes
ranging from 1024-by-1024 to 21504-by-21504.
Creating a matrix of size 1024-by-1024.
Gigaflops on CPU: 45.881347
Gigaflops on GPU: 84.044136
Creating a matrix of size 5120-by-5120.
Gigaflops on CPU: 112.758309
Gigaflops on GPU: 653.228694
Creating a matrix of size 9216-by-9216.
Gigaflops on CPU: 135.980895
Gigaflops on GPU: 883.155216
Creating a matrix of size 13312-by-13312.
Gigaflops on CPU: 223.848074
Gigaflops on GPU: 975.277154
Creating a matrix of size 17408-by-17408.
Gigaflops on CPU: 254.737638
Gigaflops on GPU: 1004.284010
Creating a matrix of size 21504-by-21504.
Gigaflops on CPU: 277.688546
Gigaflops on GPU: 1028.731291

GPU-Accelerated Stencil Operations

MATLAB also includes a couple of Stencil Operation demos running on a GPU. These include both a “generic” implementation and an optimized implementation using GPU shared & texture memory. As shown below, MATLAB GPU speedups can be 30+ times faster than MATLAB on CPUs with properly-optimized algorithms.

>> paralleldemo_gpu_mexstencil
Average time on the GPU: 1.119ms per generation
Average time of 0.038ms per generation (29.4x faster).
Average time of 0.019ms per generation (58.9x faster).
First version using gpuArray:  1.119ms per generation.
MEX with shared memory: 0.038ms per generation (29.4x faster).
MEX with texture memory: 0.019ms per generation (58.9x faster).

Running your own test of MATLAB GPU speedups

To see a list of other useful demos, take a look at the GPU-accelerated examples on Mathworks FileExchange. You’ll find a large number of useful demonstrations, including:

  • GPU acceleration for FFTs
  • Heat transfer equations
  • Navier-Stokes equations for incompressible fluids
  • Anisotropic Diffusion
  • Gradient Vector Flow (GVF) force field calculation
  • 3D linear and trilinear interpolation
  • more than 60 others

Also consider that hundreds of MATLAB’s standard functions support GPU acceleration. . Utilizing these capabilities is quite straightforward: your data must be loaded into a gpuArray. With this done, pass the gpuArray to any of MATLAB’s standard functions and the operations will be carried out on the GPU!

MATLAB paramSweep demo

Will GPU acceleration speed up your research?

With our pre-configured GPU cluster, running MATLAB on high-performance GPUs is as easy as running it on your own workstation. Find out for yourself how much faster you’ll be able to work if you add GPUs to your toolbelt. Sign up for a GPU Test Drive today!

    Your Information

    Name (required)


    E-mail (required)



    Additional Requirements/Comments

    Featured Illustration:

    “Solving 2nd Order Wave Equation on the GPU Using Spectral Methods” by Jiro Doke
    Mathworks MATLAB Central

    Eliot Eshelman

    About Eliot Eshelman

    My interests span from astrophysics to bacteriophages; high-performance computers to small spherical magnets. I've been an avid Linux geek (with a focus on HPC) for more than a decade. I work as Microway's Vice President of Strategic Accounts and HPC Initiatives.
    This entry was posted in Benchmarking, Software, Test Drive and tagged , , , . Bookmark the permalink.

    Leave a Reply

    Your email address will not be published. Required fields are marked *