Workstations, Servers and Clusters – What is HPC in 2014?

The phrase “High Performance Computing” gets thrown around a lot, but HPC means different things to different people. Every group with compute needs has its own requirements and workflow. There are many groups that would never need 100+ TFLOPS compute and 1+ PB storage, but some need more! If your current system just isn’t sufficient, keep reading to learn what type of HPC might fit your needs.

Workstations and Servers: Closer Than Ever

If you don’t take into account their physical appearance, the distinction between rackmount servers and desktop workstations has become very blurred.  It’s possible to fit 32 processor cores and 512GB of memory or four x16 gen 3.0 double-width GPUs within a workstation.  There are, of course, tradeoffs when comparing high-end workstations with rackmount servers, but you may be surprised how much you can achieve without jumping to a traditional HPC cluster.
Continue reading

NVIDIA Tesla K40 GPUs, the High Performance Choice for Many Applications

NVIDIA Tesla K40 is now the leading Tesla GPU for performance.  Here are some important use-cases where Tesla K40 might greatly accelerate your GPU-accelerated applications:

Pick Tesla K40 for

Large Data Sets

GPU memory has always been at a greater premium compared to its CPU equivalent. If you have a large data set, the 12GB of GDDR5 on Tesla K40 could be an excellent match for your application. Many common CUDA codes break apart data into chunks explicitly sized to the GPU memory space and start their compute algorithm on each chunk. This was previously limited to 5GB (Tesla K20) or 6GB (Tesla K20X). With Tesla K40, your chunks are twice the size:

GPU Tesla K20 Tesla K20X Tesla K40
Memory Capacity 5GB GDDR5 6GB GDDR5 12GB GDDR5
Max Memory Bandwidth 208GB/sec 250GB/sec 288Gb/sec

 

Frequent PCI-E Bus Transfers

Tesla K40 fully supports PCI-Express Gen3. CUDA codes that constantly move substantial data (>6GB/sec) across the PCI-E bus will benefit greatly from Gen3. We’ve seen Tesla K40 GPUs deliver up to 10GB/sec in NVIDIA’s CUDA bandwidth tests.
Continue reading

SC13 Highlights

SC13 Denver Logo
If you asked around at SC this year, some attendees might have told you there wasn’t much new going on. It’s true that not every company was launching new hardware with 2X performance, but there were significant announcements. Some are shipping now and some are looking forward into 2014 or 2015. See our top picks below.

Continue reading

NVIDIA Tesla K40 “Atlas” GPU Accelerator (Kepler GK110b) Up Close

NVIDIA’s latest Tesla accelerator is without a doubt the most powerful GPU available. With almost 3,000 CUDA cores and 12GB GDDR5 memory, it wins in practically every* performance test you’ll see. As with the “Kepler” K20 GPUs, the Tesla K40 supports NVIDIA’s latest SMX, Dynamic Parallelism and Hyper-Q capabilities (CUDA compute capability 3.5). It also introduces professional-level GPU Boost capability to squeeze every bit of performance your code can pull from the GPU’s 235W power envelope.

Maximum GPU Memory and Compute Performance: Tesla K40 GPU Accelerator

Integrated in Microway NumberSmasher GPU Servers and GPU Clusters

Photograph of the new NVIDIA Tesla "Atlas" K40 "Kepler" GPU Accelerator

Specifications

  • 2880 CUDA GPU cores (GK110b)
  • 4.2 TFLOPS single; 1.4 TFLOPS double-precision
  • 12GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • GPU Boost increased clock speeds
  • Supports Dynamic Parallelism and HyperQ features
  • Active and Passive heatsinks available for installation in workstations and specially-designed GPU servers

The new GPU also leverages PCI-E 3.0 to achieve over 10 gigabytes per second transfers between the host (CPUs) and the devices (GPUs):

Continue reading

CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs

The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The higher-end K20X, which is used in the Titan and Bluewaters supercomputers, contains a massive 2,688 CUDA cores and achieves peak single-precision floating-point performance of 3.95 Tflops. In contrast, the Fermi-architecture Tesla M2090 (compute capability 2.0) has peak single-precision performance of 1.3 Tflops.

Continue reading

Avoiding GPU Memory Performance Bottlenecks

This post is Topic #3 (post 3) in our series Parallel Code: Maximizing your Performance Potential.

Many applications contain algorithms which make use of multi-dimensional arrays (or matrices). For cases where threads need to index the higher dimensions of the array, strided accesses can’t really be avoided. In cases where strided access is actually avoidable, every effort to avoid accesses with a stride greater than one should be taken.

So all this advice is great and all, but I’m sure you’re wondering “What actually is strided memory access?” The following example will illustrate this phenomenon and outline its effect on the effective bandwidth:

__global__ void strideExample (float *outputData, float *inputData, int stride=2)
{
    int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    outputData[index] = inputData[index];
}

In the above code, threads within a warp access data words in memory with a stride of 2. This leads to a load of two L1 cache lines per warp. The actual accessing of the memory is shown below.

Continue reading

GPU Shared Memory Performance Optimization

This post is Topic #3 (post 2) in our series Parallel Code: Maximizing your Performance Potential.

In my previous post, I provided an introduction to the various types of memory available for use in a CUDA application. Now that you’re familiar with these types of memory, the more important topic can be addressed – accessing the memory.

Think for a moment: global memory is up to 150x slower than some of the other types of device memory available. If you could reduce the number of global memory accesses needed by your application, then you’d realize a significant performance increase (especially if your application performs the same operations in a loop or things of that nature). The easiest way to obtain this performance gain is to coalesce your memory accesses to global memory. The number of concurrent global memory accesses of the threads in a given warp is equal to the number of cache lines needed to service all of the threads of the warp. So how do you coalesce your accesses you ask? There are many ways.

Continue reading

Intel Xeon E5-2600v2 “Ivy Bridge” Processor Review

With the introduction of Intel’s new Xeon E5-2600v2 processors, there are exciting new choices for HPC users. Overall, the Xeon E5-2600 series processors have provided the highest cost-effective HPC performance available to date. This new set of models builds upon that success to offer higher core counts and faster performance.

Important changes available in E5-2600v2 “Ivy Bridge” include:

  • Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
  • Support for DDR3 memory speeds up to 1866MHz
  • Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for point-to-point transfers
  • AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats. These operations are of particular importance to graphics and image processing applications.
  • Intel APIC Virtualization (APICv) provides increased virtualization performance

Continue reading

Intel Xeon Phi 7120P “Knights Corner” x86 Coprocessor Up Close

Currently, Intel’s premier Xeon Phi product is the 7120P. It offers the most cores, the fastest clock speed, the most memory and the greatest performance. We’ve got these cards running in our lab – here’s a photo:

Photo of Intel Xeon Phi 7120P Coprocessor

The Phi 7120 is also unique in that it is the only model with Turbo Boost enabled, allowing its clock speed to boost from 1.238GHz to 1.333GHz.

Continue reading

NVIDIA Tesla K20 GPU Accelerator (Kepler GK110) Up Close

NVIDIA’s Tesla K20 GPU is currently the de facto standard for high-performance heterogeneous computing. Based upon the Kepler GK110 architecture, these are the GPUs you want if you’ll be taking advantage of the latest advancements available in CUDA 5.0 and CUDA 5.5. This generation was designed specifically for the exciting new features in CUDA such as dynamic parallelism.

With 5GB or 6GB of GDDR5 memory, they provide up to 3.95 TFLOPS single-precision and 1.33 TFLOPS double-precision floating point performance. Two variants of the GPU are available: K20 (available for workstations and servers) and K20X (available only for servers). Here are the full specifications:

Continue reading