Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs

MATLAB solving a second order wave equation on Tesla GPUs

MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. If you haven’t tried yet, take this opportunity to test MATLAB performance on GPUs. Microway’s GPU Test Drive makes the process quick and easy. As we’ll show in this post, you can expect to see 3X to 6X performance increases for many tasks (with 30X to 60X speedups on select workloads).

Continue reading

Running GPU Benchmarks of HOOMD-blue on a Tesla K40 GPU-Accelerated Cluster

Cropped shot of a HOOMD-blue micellar crystals simulation (visualized with VMD)

This short tutorial explains the usage of the GPU-accelerated HOOMD-blue particle simulation toolkit on our GPU-accelerated HPC cluster. Microway allows you to quickly test your codes on the latest high-performance systems – you are free to upload and run your own software, although we also provide a variety of pre-compiled applications with built-in GPU acceleration. Our GPU Test Drive Cluster is a useful resource for benchmarking the faster performance which can be achieved with NVIDIA Tesla GPUs.

This post demonstrate HOOMD-blue, which comes out of the Glotzer group at the University of Michigan. HOOMD blue supports a wide variety of integrators and potentials, as well as the capability to scale runs up to thousands of GPU compute processors. We’ll demonstrate one server with dual NVIDIA® Tesla®  K40 GPUs delivering speedups over 13X!

Continue reading

Benchmarking NAMD on a GPU-Accelerated HPC Cluster with NVIDIA Tesla K40

Cropped shot of a NAMD stmv simulation (visualized with VMD)

This is a tutorial on the usage of GPU-accelerated NAMD for molecular dynamics simulations. We make it simple to test your codes on the latest high-performance systems – you are free to use your own applications on our cluster and we also provide a variety of pre-installed applications with built-in GPU support. Our GPU Test Drive Cluster acts as a useful resource for demonstrating the increased application performance which can be achieved with NVIDIA Tesla GPUs.

This post describes the scalable molecular dynamics software NAMD, which comes out of the Theoretical and Computational Biophysics Group at the University of Illinois Urbana-Champaign. NAMD supports a variety of operational modes, including GPU-accelerated runs across large numbers of compute nodes. We’ll demonstrate how a single server with NVIDIA® Tesla®  K40 GPUs can deliver speedups over 4X!

Continue reading

Running AMBER on a GPU Cluster

Cropped shot of an AMBER nucleosome simulation (visualized with VMD)

Welcome to our tutorial on GPU-accelerated AMBER! We make it easy to benchmark your applications and problem sets on the latest hardware. Our GPU Test Drive Cluster provides developers, scientists, academics, and anyone else interested in GPU computing with the opportunity to test their code. While Test Drive users are given free reign to use their own applications on the cluster, Microway also provides a variety of pre-installed GPU accelerated applications.

In this post, we will look at the molecular dynamics package AMBER. Collaboratively developed by professors at a variety of university labs, the latest versions of AMBER natively support GPU acceleration. We’ll demonstrate how NVIDIA® Tesla®  K40 GPUs can deliver a speedup of up to 86X!

Continue reading

AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features

We’re very excited to be delivering systems with the new Xeon E5-2600v3 and E5-1600v3 CPUs. If you are the type who loves microarchitecture details and compiler optimization, there’s a lot to gain. If you haven’t explored the latest techniques and instructions for optimization, it’s never a bad time to start.

Many end users don’t always see instruction changes as consequential. However, they can be absolutely critical to achieving optimal application performance. Here’s a comparison of Theoretical Peak Performance of the latest CPUs with and without FMA3:
Plot of Xeon E5-2600v3 Theoretical Peak Performance (GFLOPS)

Only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Achieved performance for well-parallelized & optimized applications is likely to fall between the grey and colored bars. Still, without employing a compiler optimized for FMA3 instructions, you are leaving significant potential performance of your Xeon E5-2600v3-based hardware purchase on the table. Continue reading

Intel Xeon E5-2600 v3 “Haswell” Processor Review

Update:

As of March 31, 2016 we recommend version four of these Intel Xeon CPUs. Please see our new post Intel Xeon E5-2600 v4 “Broadwell” Processor Review

Intel has launched brand new Xeon E5-2600 v3 CPUs with groundbreaking new features. These CPUs build upon the leading performance of their predecessors with more a robust microarchitecture, faster memory, wider buses, and increased core counts and clock speed. The result is dramatically improved performance for HPC.

Important changes available in E5-2600 v3 “Haswell” include:

  • Support for brand new DDR4-2133 memory
  • Up to 18 processor cores per socket (with options for 6- to 16-cores)
  • Improved AVX 2.0 Instructions with:
    • New floating point FMA, with up to 2X the FLOPS per core (16 FLOPS/clock)
    • 256-bit wide integer vector instructions
  • A revised C610 Series Chipset delivering substantially improved I/O for every server (SATA, USB 3.0)
  • Increased L1, L2 cache bandwidth and faster QPI links
  • Slightly tweaked “Grantley” socket (Socket R3) and platforms

Continue reading

CUB in Action – some simple examples using the CUB template library

In my previous post, I presented a brief introduction to the CUB library of CUDA primitives written by Duane Merrill of NVIDIA. CUB provides a set of highly-configurable software components, which include warp- and block-level kernel components as well as device-wide primitives. This time around, we will actually look at performance figures for codes that utilize CUB primitives. We will also briefly compare the CUB-based codes to programs that use the analogous Thrust routines, both from a performance and programmability perspective. These comparisons utilize the CUB v1.3.1 and Thrust v1.7.0 releases and CUDA 6.0.

Before we proceed, I need to issue one disclaimer: the examples below were written after a limited amount of experimentation with the CUB library, and they do not necessarily represent the most optimized implementations. However, these examples do illustrate the flexibility of the API and they give an idea of the kind of performance that can be achieved using CUB with only modest programming effort.

Continue reading

PCI-Express Root Complex Confusion?

I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors).

First, please feel free to contact one of Microway’s experts. We’d be happy to work with you on your project to ensure your design will function correctly (both in theory and in practice). We also diagram most GPU platforms we sell, as well as explain their advantages, in our Microway's NVIDIA Tesla V100 GPU Solutions Guide

Microway's NVIDIA Tesla V100 GPU Solutions Guide
.

It is tempting to just look at the number of PCI-Express slots in the systems you’re evaluating and assume they’re all the same. Unfortunately, it’s not so simple, because each CPU only has a certain amount of bandwidth available. Additionally, certain high-performance features – such as NVIDIA’s GPU Direct technology – require that all components be attached to the same PCI-Express root complex. Servers and workstations with multiple processors have multiple PCI-Express root complexes. We dive deeply into these issues in our post about Common PCI-Express Myths.

Continue reading

Introducing CUDA UnBound (CUB)

CUB – a configurable C++ template library of high-performance CUDA primitives

Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth of up to 177 Gigabytes per second and peak double-precision floating-point performance of 665 Gigaflops. By comparison, today’s Tesla K40 (Kepler GK110b architecture) has peak memory bandwidth of 288 Gigabytes per second and provides reported peak double-precision performance of over 1.4 Teraflops. However, the K40’s reign as the most advanced GPGPU hardware is coming to an end, and Kepler will shortly be superseded by Maxwell-class cards.

Actually achieving optimal performance on diverse GPU architectures can be challenging, since it relies on the implementation of carefully-crafted kernels that incorporate extensive knowledge of the underlying hardware and which take full advantage of relevant features of the CUDA programming model. This places a considerable burden on the CUDA developer seeking to port her application to a new generation of GPUs or looking to ensure performance across a range of architectures.

Fortunately, many CUDA applications are formulated in terms of a small set of primitives, such as parallel reduce, scan, or sort. Before attempting to handcraft these primitive operations ourselves, we should consider using one of the libraries of optimized primitives available to CUDA developers. Such libraries include Thrust and CUDPP, but in this post, we will focus on the CUB library developed by Duane Merrill of NVIDIA Research. CUB – the name derives from “CUDA Unbound” – provides generic high-performance primitives targeting multiple levels of application development. For example, CUB supports a set of device-wide primitives, which are called from the host, and in this regard, the functionality provided by CUB overlaps with Thrust to some degree. However, unlike Thrust, CUB also provides a set of kernel components that operate at the thread-block and thread-warp levels.

Continue reading

Intel Xeon E5-4600 v2 “Ivy Bridge” Processor Review

Many within the HPC community have been eagerly awaiting the new Intel Xeon E5-4600 v2 CPUs. To those already familiar with the “Ivy Bridge” architecture in the Xeon E5-2600 v2 processors, many of the updated features of these 4-socket Xeon E5-4600 v2 “Ivy-Bridge” CPUs should seem very familiar. Read on to learn the details.

Important changes available in the Xeon E5-4600 v2 “Ivy Bridge” CPUs include:

  • Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
  • Support for DDR3 memory speeds up to 1866MHz
  • AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats. These operations are of particular importance to graphics and image processing applications.
  • Intel APIC Virtualization (APICv) provides increased virtualization performance
  • Improved PCI-Express generation 3.0 support with superior compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for point-to-point transfers

Intel Xeon E5-4600 v2 Series Specifications

Model Frequency Turbo Boost Core Count Memory Speed L3 Cache QPI Speed TDP (Watts)
E5-4657L v2 2.40 GHz 2.90 GHz 12 1866 MHz 30MB 8 GT/S 115
E5-4650 v2 2.40 GHz 2.90 GHz 10 25MB 95W
E5-4640 v2 2.20 GHz 2.70 GHz 20MB
E5-4627 v2 3.30 GHz 3.60 GHz 8 16MB 7.2 GT/S 130W
E5-4620 v2 2.60 GHz 3.00 GHz 1600 MHz 20MB 95W
E5-4610 v2 2.30 GHz 2.70 GHz 16MB

HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.

Continue reading