When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a deeper dive into this impressive new system.
The focus of this NVIDIA DGX™ A100 review is on the hardware inside the system – the server features a number of features & improvements not available in any other type of server at the moment. DGX will be the “go-to” server for 2020. But hardware only tells part of the story, particularly for NVIDIA’s DGX products. NVIDIA employs more software engineers than hardware engineers, so be certain that application and GPU library performance will continue to improve through updates to the DGX Operating System and to the whole catalog of software containers provided through the NGC hub. Expect more details as the year continues.
As one of NVIDIA’s Elite partners, we see a lot of GPU deployments in higher education. GPUs have been proving themselves in HPC for over a decade, and they are the de-facto standard for deep learning research. They’re also becoming essential for other types of machine learning and data science. But GPUs are not always available to students, particularly undergraduate students.
GPU-accelerated Classrooms at MSOE
Photo of MSOE’s ROSIE cluster
One deployment I’m particularly proud of runs at the Milwaukee School of Engineering, where it is used for undergraduate education, as well as for faculty and industry research. This cluster leverages a combination of NVIDIA’s Volta-generation DGX systems, as well as NVIDIA Tesla T4 GPUs, Mellanox Ethernet, and NetApp storage.
Rather than having to learn a more arcane supercomputer interface, students are able to start GPU-accelerated Jupyter sessions with the click of a button in their web browser.
The cluster is connected to NVIDIA’s NGC hub, providing pre-built containers with the latest HPC & AI software stacks. The DGX systems do the heavy lifting and the Tesla T4 systems service less demanding needs (such as student sessions during class).
Microway’s team delivered all of this fully integrated and ready-to-run, allowing MSOE’s undergrads to get hands on the latest, highest-performing hardware and software tools. And they don’t have to dive down into huge levels of complexity until they’re ready.
Close up photo of the DGX-1, servers, and storage in ROSIE
NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.
Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:
When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?
I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?
Experimentation, Performance, and GPU Accelerated Data Science Tooling
Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.
Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X! Continue reading
In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf. We also highlight the models that scale well and should be trained on larger numbers of GPUs. Models with poor scalability should be trained on fewer GPUs, which allows for resource sharing among multiple users. As such, we provide insight into common deep learning workloads and how to best leverage the multi-gpu DGX-1 deep learning system for training the models.
MLPerf – a benchmarking suite for deep learning applications
Just as HPC system design is evolving to achieve good performance for Deep Learning applications, there is also an ever-increasing need to have a good set of benchmarks to quantify this performance. Many benchmarking tools have been proposed. For example, Baidu Research released DeepBench which focuses on basic operations involved in neural networks like convolution, GEMM, Recurrent Layers, and All Reduce. Yet there is no provision to compare different systems/workstations or even software frameworks. Tensorflow introduced TF_CNN_BENCH which is only single-domain and benchmarks only convolutional network-based deep-learning workloads. With a diversity of workloads and a variety of different hardware configurations, we need a more general approach to benchmarking deep learning applications.
The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.
2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.
Important changes in AMD EPYC “Rome” CPUs include:
- Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
- PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
- 2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
- DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
- Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
- New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries
What is Bowtie2?
Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow. It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome. Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.
What is RNA-Seq?
RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time. This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors. Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.
There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries
Background and history
Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice. Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR). The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.
However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique. Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures. The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution. The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:
- Much larger structures can be determined than by X-ray or NMR
- Structures can be determined in a more native state than by using X-ray
The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.
With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. Starting today, Microway is shipping these new CPUs across our entire line of turn-key Xeon workstations, systems, and clusters.
Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:
- Higher CPU core counts for many SKUs in the product stack
- Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
- Introduction of the new AVX-512 VNNI instruction for Intel Deep Learning Boost (VNNI)
provides significant, more efficient deep learning inference acceleration
- Higher memory capacity & performance:
- Most CPU models provide increased memory speeds
- Support for DDR4 memory speeds up to 2933MHz
- Large-memory capabilities with Intel Optane DC Persistent Memory
- Support for up to 4.5TB-per-socket system memory
- Integrated hardware-based security mitigations against side-channel attacks
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.
For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.
The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.
The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.
The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.
Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”
Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.
Tesla V100 Price
The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:
|Tesla GPU model
||Double-Precision Performance (FP64)
||Dollars per TFLOPS
||Deep Learning Performance (TensorFLOPS or 1/2 Precision)
||Dollars per DL TFLOPS
|Tesla V100 PCI-E 16GB
$11,458* for 32GB
$1,637 for 32GB
$102.30 for 32GB
|Tesla P100 PCI-E 16GB
|Tesla V100 SXM 16GB
$11,458* for 32GB
$1,469 for 32GB
$91.66 for 32GB
|Tesla P100 SXM2 16GB
* single-unit list price before any applicable discounts (ex: EDU, volume)
- Tesla V100 delivers a big advance in absolute performance, in just 12 months
- Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
- Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
- Tesla P100 remains a reasonable price/performance GPU choice, in select situations
- Tesla P100 will still dramatically outperform a CPU-only configuration