Home > HPC Tech Tips

DGX A100 review: Throughput and Hardware Summary

Posted on June 26, 2020 by Eliot Eshelman

When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a deeper dive into this impressive new system.

The focus of this NVIDIA DGX™ A100 review is on the hardware inside the system – the server features a number of features & improvements not available in any other type of server at the moment. DGX will be the “go-to” server for 2020. But hardware only tells part of the story, particularly for NVIDIA’s DGX products. NVIDIA employs more software engineers than hardware engineers, so be certain that application and GPU library performance will continue to improve through updates to the DGX Operating System and to the whole catalog of software containers provided through the NGC hub. Expect more details as the year continues.

Continue reading →

Deploying GPUs for Classroom and Remote Learning

Posted on May 22, 2020 by Eliot Eshelman

As one of NVIDIA’s Elite partners, we see a lot of GPU deployments in higher education. GPUs have been proving themselves in HPC for over a decade, and they are the de-facto standard for deep learning research. They’re also becoming essential for other types of machine learning and data science. But GPUs are not always available to students, particularly undergraduate students.

GPU-accelerated Classrooms at MSOE

Photo of the ROSIE cluster, with artwork featuring a rose tattoo

Photo of MSOE’s ROSIE cluster

One deployment I’m particularly proud of runs at the Milwaukee School of Engineering, where it is used for undergraduate education, as well as for faculty and industry research. This cluster leverages a combination of NVIDIA’s Volta-generation DGX systems, as well as NVIDIA Tesla T4 GPUs, Mellanox Ethernet, and NetApp storage.

Rather than having to learn a more arcane supercomputer interface, students are able to start GPU-accelerated Jupyter sessions with the click of a button in their web browser.

The cluster is connected to NVIDIA’s NGC hub, providing pre-built containers with the latest HPC & AI software stacks. The DGX systems do the heavy lifting and the Tesla T4 systems service less demanding needs (such as student sessions during class).

Microway’s team delivered all of this fully integrated and ready-to-run, allowing MSOE’s undergrads to get hands on the latest, highest-performing hardware and software tools. And they don’t have to dive down into huge levels of complexity until they’re ready.

Close up photo of the equipment in the ROSIE cluster

Close up photo of the DGX-1, servers, and storage in ROSIE

Continue reading →

What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science

Posted on March 4, 2020 by Brett Newman

NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.

Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.

Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X! Continue reading →

Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1

Posted on August 23, 2019 by Aakash Kardam

In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf. We also highlight the models that scale well and should be trained on larger numbers of GPUs. Models with poor scalability should be trained on fewer GPUs, which allows for resource sharing among multiple users. As such, we provide insight into common deep learning workloads and how to best leverage the multi-gpu DGX-1 deep learning system for training the models.

MLPerf – a benchmarking suite for deep learning applications

Just as HPC system design is evolving to achieve good performance for Deep Learning applications, there is also an ever-increasing need to have a good set of benchmarks to quantify this performance. Many benchmarking tools have been proposed. For example, Baidu Research released DeepBench which focuses on basic operations involved in neural networks like convolution, GEMM, Recurrent Layers, and All Reduce. Yet there is no provision to compare different systems/workstations or even software frameworks. Tensorflow introduced TF_CNN_BENCH which is only single-domain and benchmarks only convolutional network-based deep-learning workloads. With a diversity of workloads and a variety of different hardware configurations, we need a more general approach to benchmarking deep learning applications.

Continue reading →

2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC

Posted on August 7, 2019 by Brett Newman

Index of our review:
Changes vs Previous Architectures
Leadership Performance
Class Leading Price-Performance
Chiplets + IO & Compute Dies
Memory Bandwidth
PCI-E Gen4
Infinity Fabric
SKU Selection for Clusters

The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.

2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.

Important changes in AMD EPYC “Rome” CPUs include:

Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node

Continue reading →

Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines

Posted on June 28, 2019 by Adam Marko (for Microway)

This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow. It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome. Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.

What is RNA-Seq?

RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time. This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors. Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.

There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.

Continue reading →

CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research

Posted on April 11, 2019 by Adam Marko (for Microway)

This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice. Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR). The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.

However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique. Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures. The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution. The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:

Much larger structures can be determined than by X-ray or NMR
Structures can be determined in a more native state than by using X-ray

The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.

Continue reading →

Intel Xeon Scalable “Cascade Lake SP” Processor Review

Posted on April 2, 2019 by Eliot Eshelman

With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. Starting today, Microway is shipping these new CPUs across our entire line of turn-key Xeon workstations, systems, and clusters.

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

Higher CPU core counts for many SKUs in the product stack
Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
Introduction of the new AVX-512 VNNI instruction for Intel Deep Learning Boost (VNNI)
provides significant, more efficient deep learning inference acceleration
Higher memory capacity & performance:
- Most CPU models provide increased memory speeds
- Support for DDR4 memory speeds up to 2933MHz
- Large-memory capabilities with Intel Optane DC Persistent Memory
- Support for up to 4.5TB-per-socket system memory
Integrated hardware-based security mitigations against side-channel attacks

Continue reading →

NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks

Posted on March 15, 2019 by Connor Kenyon

Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

Continue reading →

NVIDIA Tesla V100 Price Analysis

Posted on May 8, 2018 by Brett Newman

Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU model	Price	Double-Precision Performance (FP64)	Dollars per TFLOPS	Deep Learning Performance (TensorFLOPS or 1/2 Precision)	Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB or 32GB	$10,664* $11,458 for 32GB*	7 TFLOPS	$1,523 $1,637 for 32GB	112 TFLOPS	$95.21 $102.30 for 32GB
Tesla P100 PCI-E 16GB	$7,374*	4.7 TFLOPS	$1,569	18.7 TFLOPS	$394.33
Tesla V100 SXM 16GB or 32GB	$10,664* $11,458 for 32GB*	7.8 TFLOPS	$1,367 $1,469 for 32GB	125 TFLOPS	$85.31 $91.66 for 32GB
Tesla P100 SXM2 16GB	$9,428*	5.3 TFLOPS	$1,779	21.2 TFLOPS	$444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

Tesla V100 delivers a big advance in absolute performance, in just 12 months
Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
Tesla P100 remains a reasonable price/performance GPU choice, in select situations
Tesla P100 will still dramatically outperform a CPU-only configuration

Continue reading →

DGX A100 review: Throughput and Hardware Summary

Deploying GPUs for Classroom and Remote Learning

GPU-accelerated Classrooms at MSOE

What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1

MLPerf – a benchmarking suite for deep learning applications

2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC

Important changes in AMD EPYC “Rome” CPUs include:

Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines

This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

What is RNA-Seq?

CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research

This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Intel Xeon Scalable “Cascade Lake SP” Processor Review

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks

NVIDIA Tesla V100 Price Analysis

Tesla V100 Price

Key Points

Archives

Meta

Talk to an Expert

Take a Test Drive

Configure Your Solution

Schedule a Consultation

Subscribe to Microway’s Technical Newsletter

HPC-Tech-Tip Categories

Subscribe to Blog

Technologies

Products

Knowledge Center

Pre-Configured Systems

NVIDIA DGX H100™

NVIDIA DGX POD™

EOL – NVIDIA DGX A100™

AI Anywhere Solution

GPU-accelerated Classrooms at MSOE

Experimentation, Performance, and GPU Accelerated Data Science Tooling

MLPerf – a benchmarking suite for deep learning applications

Important changes in AMD EPYC “Rome” CPUs include:

This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

What is RNA-Seq?

This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

Tesla V100 Price

Key Points

Archives

Meta

Talk to an Expert

Take a Test Drive

Configure Your Solution

Schedule a Consultation

Subscribe to Microway’s Technical Newsletter

HPC-Tech-Tip Categories

HPC-Tech-Tip Tags

Subscribe to Blog