In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf. We also highlight the models that scale well and should be trained on larger numbers of GPUs. Models with poor scalability should be trained on fewer GPUs, which allows for resource sharing among multiple users. As such, we provide insight into common deep learning workloads and how to best leverage the multi-gpu DGX-1 deep learning system for training the models.
MLPerf – a benchmarking suite for deep learning applications
Just as HPC system design is evolving to achieve good performance for Deep Learning applications, there is also an ever-increasing need to have a good set of benchmarks to quantify this performance. Many benchmarking tools have been proposed. For example, Baidu Research released DeepBench which focuses on basic operations involved in neural networks like convolution, GEMM, Recurrent Layers, and All Reduce. Yet there is no provision to compare different systems/workstations or even software frameworks. Tensorflow introduced TF_CNN_BENCH which is only single-domain and benchmarks only convolutional network-based deep-learning workloads. With a diversity of workloads and a variety of different hardware configurations, we need a more general approach to benchmarking deep learning applications.
The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.
2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.
Important changes in AMD EPYC “Rome” CPUs include:
- Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
- PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
- 2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
- DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
- Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
- New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries
What is Bowtie2?
Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow. It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome. Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.
What is RNA-Seq?
RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time. This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors. Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.
There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries
Background and history
Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice. Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR). The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.
However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique. Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures. The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution. The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:
- Much larger structures can be determined than by X-ray or NMR
- Structures can be determined in a more native state than by using X-ray
The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.
With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. Starting today, Microway is shipping these new CPUs across our entire line of turn-key Xeon workstations, systems, and clusters.
Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:
- Higher CPU core counts for many SKUs in the product stack
- Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
- Introduction of the new AVX-512 VNNI instruction for Intel Deep Learning Boost (VNNI)
provides significant, more efficient deep learning inference acceleration
- Higher memory capacity & performance:
- Most CPU models provide increased memory speeds
- Support for DDR4 memory speeds up to 2933MHz
- Large-memory capabilities with Intel Optane DC Persistent Memory
- Support for up to 4.5TB-per-socket system memory
- Integrated hardware-based security mitigations against side-channel attacks
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.
For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.
The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.
The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.
The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.
Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”
Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.
Tesla V100 Price
The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:
|Tesla GPU model
||Double-Precision Performance (FP64)
||Dollars per TFLOPS
||Deep Learning Performance (TensorFLOPS or 1/2 Precision)
||Dollars per DL TFLOPS
|Tesla V100 PCI-E 16GB
$11,458* for 32GB
$1,637 for 32GB
$102.30 for 32GB
|Tesla P100 PCI-E 16GB
|Tesla V100 SXM 16GB
$11,458* for 32GB
$1,469 for 32GB
$91.66 for 32GB
|Tesla P100 SXM2 16GB
* single-unit list price before any applicable discounts (ex: EDU, volume)
- Tesla V100 delivers a big advance in absolute performance, in just 12 months
- Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
- Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
- Tesla P100 remains a reasonable price/performance GPU choice, in select situations
- Tesla P100 will still dramatically outperform a CPU-only configuration
Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.
Executing hardware or health checks
DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:
Report what GPUs are installed, in which slots and PCI-E trees and make a group
Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.
Determine GPU link states, bandwidths
Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)
Read temps, boost states, power consumption, or utilization
Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster
Driver versions and CUDA versions
Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system
Run sample jobs and integrated validation
Run basic diagnostics and sample jobs that are built into the DCGM package.
DCGM provide a mechanism to set policies to a group of GPUs.
Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search engine image classification, and cancer detection in biomedical imaging. Most businesses have collected troves of data or incorporated new avenues to collect data in recent years. Through the innovations of deep learning, that same data can be used to gain insight, make accurate predictions, and pave the path to discovery.
Developing a plan to integrate AI workloads into an existing business infrastructure or research group presents many challenges. However, there are two key elements that will drive the decisions to customizing an AI cluster. First, understanding the types and volumes of data is paramount to beginning to understand the computational requirements of training the neural network. Secondly, understanding the business expectation for time to result is equally important. Each of these factors influence the first and second stages of the AI workload, respectively. Underestimating the data characteristics will result in insufficient computational and infrastructure resources to train the networks in a reasonable timeframe. Moreover, underestimating the value and requirement of time-to-results can fail to deliver ROI to the business or hamper research results.
Below are summaries of the different features of system design that must be evaluated when configuring an AI cluster.
The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.
Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.
Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink
For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs
Performance of Tesla GPUs, Generation to Generation
||Tesla V100 with NVLink
||Tesla V100 PCI-E
||Tesla P100 with NVLink
||Tesla P100 PCI-E
||Ratio Tesla V100:P100
||21.2 TFLOPS 1/2 Precision
||18.7 TFLOPS 1/2 Precision
|Interface (bidirec. BW)
Selecting the right Tesla V100 for you:
With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.
However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options: