At Microway we design a lot of GPU computing systems. One of the strengths of GPU-compute is the flexibility PCI-Express bus. Assuming the server has appropriate power and thermals, it enables us to attach GPUs with no special interface modifications. We can even swap to new GPUs under many circumstances. However, we encounter a lot of misinformation about PCI-Express and GPUs. Here are a number of myths about PCI-E:
1. PCI-Express is controlled through the chipset
No longer in modern Intel CPU-based platforms. Beginning with the Sandy Bridge CPU architecture in 2012 (Xeon E5 series CPUs, Xeon E3 series CPUs, Core i7-2xxx and newer) Intel integrated the PCI-Express controller into the the CPU die itself. Bringing PCI-Express onto the CPU die came with a substantial latency benefit. This was a major change in platform design, and Intel coupled it with the addition of PCI-Express Gen3 support.
There is a lot of material available on RAID, describing the technologies, the options, and the pitfalls. However, there isn’t a great deal on RAID from an HPC perspective. We’d like to provide an introduction to RAID, clear up a few misconceptions, share with you some best practices, and explain what sort of configurations we recommend for different use cases.
What is RAID?
Originally known as Redundant Array of Inexpensive Disks, the acronym is now more commonly considered to stand for Redundant Array of Independent Disks. The main benefits to RAID are improved disk read/write performance, increased redundancy, and the ability to increase logical volume sizes.
RAID is able to perform these functions primarily through striping, mirroring, and parity. Striping is when files are broken down into segments, which are then placed on different drives. Because the files are spread across multiple drives that are running in parallel, performance is improved. Mirroring is when data is duplicated on the fly across drives. Parity within the context of RAID refers to when data redundancy is distributed across all drives so that when one or more (depending on the RAID level) drives fail, the data can be reconstructed from the remaining drives. Continue reading
NVIDIA has once again raised the bar on GPU computing with the release of the new Tesla K80 GPU accelerator. With up to 8.74 TFLOPS of single-precision performance with GPU Boost, the Tesla K80 has massive capability and leading density.
Here are the important performance specifications:
- Two GK210 chips on a single PCB
- 4992 total SMX CUDA cores: 2496 on each chip!
- Total of 24GB GDDR5 memory; aggregate memory bandwidth of 480GB/sec
- 5.6 TFLOPS single precision, 1.87 TFLOPS double precision
- 8.74 TFLOPS single precision, 2.91 TFLOPS double precision with GPU Boost
- 300W TDP
To achieve this performance, Tesla K80 is really two GPUs in one. This Tesla K80 block diagram illustrates how each GK210 GPU has its own dedicated memory and how they communicate at x16 speeds with the PCIe bus using a PCIe switch:
We know that many of our readers are interested in seeing how molecular dynamics applications perform with GPUs, so we are continuing to highlight various packages. This time we will be looking at GROMACS, a well-established and free-to-use (under GNU GPL) application. GROMACS is a popular choice for scientists interested in simulating molecular interaction. With NVIDIA Tesla K40 GPUs, it’s common to see 2X and 3X speedups compared to the latest multi-core CPUs.
MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. If you haven’t tried yet, take this opportunity to test MATLAB performance on GPUs. Microway’s GPU Test Drive makes the process quick and easy. As we’ll show in this post, you can expect to see 3X to 6X performance increases for many tasks (with 30X to 60X speedups on select workloads).
This short tutorial explains the usage of the GPU-accelerated HOOMD-blue particle simulation toolkit on our GPU-accelerated HPC cluster. Microway allows you to quickly test your codes on the latest high-performance systems – you are free to upload and run your own software, although we also provide a variety of pre-compiled applications with built-in GPU acceleration. Our GPU Test Drive Cluster is a useful resource for benchmarking the faster performance which can be achieved with NVIDIA Tesla GPUs.
This post demonstrate HOOMD-blue, which comes out of the Glotzer group at the University of Michigan. HOOMD blue supports a wide variety of integrators and potentials, as well as the capability to scale runs up to thousands of GPU compute processors. We’ll demonstrate one server with dual NVIDIA® Tesla® K40 GPUs delivering speedups over 13X!
This is a tutorial on the usage of GPU-accelerated NAMD for molecular dynamics simulations. We make it simple to test your codes on the latest high-performance systems – you are free to use your own applications on our cluster and we also provide a variety of pre-installed applications with built-in GPU support. Our GPU Test Drive Cluster acts as a useful resource for demonstrating the increased application performance which can be achieved with NVIDIA Tesla GPUs.
This post describes the scalable molecular dynamics software NAMD, which comes out of the Theoretical and Computational Biophysics Group at the University of Illinois Urbana-Champaign. NAMD supports a variety of operational modes, including GPU-accelerated runs across large numbers of compute nodes. We’ll demonstrate how a single server with NVIDIA® Tesla® K40 GPUs can deliver speedups over 4X!
Welcome to our tutorial on GPU-accelerated AMBER! We make it easy to benchmark your applications and problem sets on the latest hardware. Our GPU Test Drive Cluster provides developers, scientists, academics, and anyone else interested in GPU computing with the opportunity to test their code. While Test Drive users are given free reign to use their own applications on the cluster, Microway also provides a variety of pre-installed GPU accelerated applications.
In this post, we will look at the molecular dynamics package AMBER. Collaboratively developed by professors at a variety of university labs, the latest versions of AMBER natively support GPU acceleration. We’ll demonstrate how NVIDIA® Tesla® K40 GPUs can deliver a speedup of up to 86X!
We’re very excited to be delivering systems with the new Xeon E5-2600v3 and E5-1600v3 CPUs. If you are the type who loves microarchitecture details and compiler optimization, there’s a lot to gain. If you haven’t explored the latest techniques and instructions for optimization, it’s never a bad time to start.
Many end users don’t always see instruction changes as consequential. However, they can be absolutely critical to achieving optimal application performance. Here’s a comparison of Theoretical Peak Performance of the latest CPUs with and without FMA3:
Only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Achieved performance for well-parallelized & optimized applications is likely to fall between the grey and colored bars. Still, without employing a compiler optimized for FMA3 instructions, you are leaving significant potential performance of your Xeon E5-2600v3-based hardware purchase on the table. Continue reading
Intel has launched brand new Xeon E5-2600 v3 CPUs with groundbreaking new features. These CPUs build upon the leading performance of their predecessors with more a robust microarchitecture, faster memory, wider buses, and increased core counts and clock speed. The result is dramatically improved performance for HPC.
Important changes available in E5-2600 v3 “Haswell” include:
- Support for brand new DDR4-2133 memory
- Up to 18 processor cores per socket (with options for 6- to 16-cores)
- Improved AVX 2.0 Instructions with:
- New floating point FMA, with up to 2X the FLOPS per core (16 FLOPS/clock)
- 256-bit wide integer vector instructions
- A revised C610 Series Chipset delivering substantially improved I/O for every server (SATA, USB 3.0)
- Increased L1, L2 cache bandwidth and faster QPI links
- Slightly tweaked “Grantley” socket (Socket R3) and platforms