NVIDIA Tesla V100 Price Analysis

Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU model Price Double-Precision Performance (FP64) Dollars per TFLOPS Deep Learning Performance (TensorFLOPS or 1/2 Precision) Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB
or 32GB
$11,458* for 32GB
7 TFLOPS $1,523
$1,637 for 32GB
112 TFLOPS $95.21
$102.30 for 32GB
Tesla P100 PCI-E 16GB $7,374* 4.7 TFLOPS $1,569 18.7 TFLOPS $394.33
Tesla V100 SXM 16GB
or 32GB
$11,458* for 32GB
7.8 TFLOPS $1,367
$1,469 for 32GB
125 TFLOPS $85.31
$91.66 for 32GB
Tesla P100 SXM2 16GB $9,428* 5.3 TFLOPS $1,779 21.2 TFLOPS $444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

  • Tesla V100 delivers a big advance in absolute performance, in just 12 months
  • Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
  • Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
  • Tesla P100 remains a reasonable price/performance GPU choice, in select situations
  • Tesla P100 will still dramatically outperform a CPU-only configuration

Continue reading

NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management

Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.

Executing hardware or health checks

DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:

Report what GPUs are installed, in which slots and PCI-E trees and make a group

Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.

Determine GPU link states, bandwidths

Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)

Read temps, boost states, power consumption, or utilization

Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster

Driver versions and CUDA versions

Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system

Run sample jobs and integrated validation

Run basic diagnostics and sample jobs that are built into the DCGM package.

Set policies

DCGM provide a mechanism to set policies to a group of GPUs.

Continue reading

Solutions for Rackmounting Extra Depth Systems

Systems having socketed GPUs, such as the Tesla V100 SXM2, or the Telsa P100 SXM2, require an additional, or extended system board, onto which the socketed GPUs are seated. In this type of system, the presence of the additional or extended board requires a system chassis having more depth than most other rackmountable systems. The challenge that comes along with this is that most data centers have server cabinets, which in their most common configurations, cannot accommodate an extra depth chassis. For this reason, extra depth cabinets are usually required for rackmounting an extra depth system.

An extra depth chassis is not always required, though. Workarounds can be implemented in order to fit an extra depth chassis into a regular depth cabinet, but an uncommon cabinet configuration would be required.

Common Extra Depth Systems

Some common extra depth systems are described in Table 1. Most of the systems described in the table are GPU systems, having socketed or PCIe Tesla GPUs.

Description Height, in rackspace units Chassis Depth
NumberSmasher 1U Tesla GPU Server
with NVLink
1U 35.2″ (894mm)
39.3″ (997mm)
with rails
NumberSmasher 1U Tesla GPU Server
(4 PCIe GPUs), dual CPU sockets
1U 35.2″ (894mm)
39.3″ (997mm)
with rails
NumberSmasher 1U Tesla GPU Server,
up to 4 Tesla V100 or P100 PCIe GPUs, single CPU socket
1U 34.5″ (877mm)
NVIDIA DGX-1 3U 34.1″ (867mm)
Octoputer 4U Tesla 8-GPU Server
with NVLink
4U 31.7″ (805mm)

Table 1. Common extra depth systems

Commonly used extra depth cabinets are described in Table 2, ranging in height from 42U to 48U. Extra depth cabinets are the easiest solution for rackmounting extra depth systems. There are workarounds, however, which can be implemented for instances where a customer already has a regular depth cabinet on-site, and would prefer to use the existing cabinet, due to scarcity of floor space, for example.

Make & Part No. Height, in rackspace units Dimensions Description
APC AR3300 42U 600mm Wide x 1200mm extra depth
APC AR3305 45U 600mm Wide x 1200mm extra depth
APC AR3307 48U 600mm Wide x 1200mm extra depth

Table 2. Common extra depth cabinets

Workarounds Solutions

Workaround solutions for mounting extra depth systems in regular depth cabinets carry require considerations and conditions.

Workaround Solution #1: Remove any vertical PDUs, and replace them with horizontal PDUs

An extra depth system would be obstructed by full height vertical, “zero U”, PDUs, as it is slid toward the back of the cabinet, preventing it from sliding fully into the cabinet. All full height vertical PDUs must be removed from the cabinet, and replaced with horizontal PDUs. Because extra depth systems usually have a secondary system board, with socketed GPUs, they are usually power-dense. The NVIDIA DGX-1 GPU-accelerated system for deep learning, for example, requires 3.5kW of power, at peak workload. Tri-phase power is recommended, whenever possible, for power-dense GPU systems. With some power-dense configurations, it will not be possible to meet peak power requirements with single phase power. Along with their unusually high power density, extra depth systems will require high airflow. Using the NVIDIA DGX-1 again as an example, each of four chassis fans will each produce a maximum of 340 CFM of air flow, for a total of 1,360 CFM, per DGX-1 system. For groups of systems having high air flow requirements, the cabinet doors must be perforated.

Selecting a tri-phase horizontal PDU can be a challenge, since they do not present as many outlets as vertically mounted PDUs, and will sometimes present an outlet type which is not compatible with the inlet type on the system(s). If the entire cabinet will not be needed for mounting extra depth systems, then using a half height PDU, possibly in addition to a horizontal PDU, may be a good choice.

APC currently offers only one tri-phase horizontal PDU. Geist offers a wider variety of tri-phase horizontal PDU types. These can be searched using the Geist PDU finder. Geist does not offer a tri-phase PDU, for use with a 208V, 20Amp source. However, it offers a variety of horizontal, tri-phase PDUs, for lines carrying 30Amps, or more. Like Geist, Server Technology offers a range of tri-phase, horizontal PDUs. They offer a tri-phase PDU model which can be used with a 208V, 20Amp power source, compatible with the NEMA L21-20P plug type.

Workaround Solution #2: Replace any full height vertical PDUs with half height PDUs

Half height PDUs are typically used for shorter cabinets. But they can also be used in regular height cabinets, to allow for installation of an extra depth system chassis. If the power receptacle is below the floor, then it will be easier to mount the PDU under the extra depth systems, with the plug pointed downward. If the power receptacle is on the ceiling, then it will be easier to mount the PDU above the extra depth systems, with the plug pointed upward.

Half height vertical PDUs should only be used if the entire cabinet will not be needed for mounting extra depth systems. This is because half height vertical PDUs will still prevent installation of extra depth systems into approximately half of the cabinet’s rackspace. Some half height PDUs can be mounted on the exterior of the cabinet frame, so that they will not obstruct extra depth systems from being rackmounted.

Figure 1. Half height vertical PDU (shown horizontally)

Workaround Solution #3: Use an Extra Wide, Regular Depth Cabinet

If there is an extra wide cabinet onsite, it could possibly be used to install an extra depth system. Extra wide cabinets provide sufficient width such that vertical PDUs, mounted on the sides, will not obstruct an extra depth chassis from sliding all the way to the back of the cabinet. Removal of cabinet rear doors may still be required, however, depending on the depth of the system and cabinet. If right angle power connectors are used, then, in some cases, removal of rear doors will not be required (e.g., DGX-1 in the AR3100 cabinet). Using right angle power connectors for connecting to the PDU itself may also allow more horizontal clearance for extra depth systems. In cases where the horizontal clearance is a bit narrow, an extra depth system could be positioned vertically into another rack position, so that it will not have to squeeze between plugs connected to vertical PDUs. Positioning the system vertically to correspond with the position of a meter LCD panel on a metered PDU, for example, or between power banks, would allow for more horizontal clearance, since plugs will not be encroaching upon horizontal clearance at these vertical positions.

Make & Part No. Height, in rackspace units Dimensions Description
APC AR3150 42U 750mm Wide x 1070mm Deep extra wide
APC AR3350 42U 750mm Wide x 1200mm extra wide, extra depth

Table 3. Some extra wide cabinets

Some PDU types are deeper than others, meaning the plugs will encroach further into the horizontal clearance, since the outlets on the PDU will be at a greater distance from the side of the cabinet. Some Raritan PDUs, for example, have more depth than some APC PDUs.

Workaround Solution #4: Mount Systems at Height Corresponding to Space between PDU Power Banks

As mentioned with workaround #3, it maybe possible to mount an extra depth system at a height so that it will not run into plugs connected to PDUs. This is possible only if the PDUs are sufficiently shallow and if the system is mounted at a height corresponding to the space between PDU power banks, where no plugs protrude.

For example, an IBM Power9 system (33.3″ depth) will still fit into an APC 3100 regular depth cabinet, with four vertical AP7541 PDUs installed at the back of the cabinet, as long as it is installed at a height between the PDU power banks.

Designing A Production-Class AI Cluster

Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search engine image classification, and cancer detection in biomedical imaging. Most businesses have collected troves of data or incorporated new avenues to collect data in recent years. Through the innovations of deep learning, that same data can be used to gain insight, make accurate predictions, and pave the path to discovery.

Developing a plan to integrate AI workloads into an existing business infrastructure or research group presents many challenges. However, there are two key elements that will drive the decisions to customizing an AI cluster. First, understanding the types and volumes of data is paramount to beginning to understand the computational requirements of training the neural network. Secondly, understanding the business expectation for time to result is equally important. Each of these factors influence the first and second stages of the AI workload, respectively. Underestimating the data characteristics will result in insufficient computational and infrastructure resources to train the networks in a reasonable timeframe. Moreover, underestimating the value and requirement of time-to-results can fail to deliver ROI to the business or hamper research results.

Below are summaries of the different features of system design that must be evaluated when configuring an AI cluster.

Continue reading

Tesla V100 “Volta” GPU Review

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

Performance of Tesla GPUs, Generation to Generation
Tesla V100 with NVLink Tesla V100 PCI-E Tesla P100 with NVLink Tesla P100 PCI-E Ratio Tesla V100:P100
TensorFLOPS 125 TFLOPS 112 TFLOPS 21.2 TFLOPS 1/2 Precision 18.7 TFLOPS 1/2 Precision ~6X
Interface (bidirec. BW)
300GB/sec 32GB/sec 160GB/sec 32GB/sec 1.88X NVLink
9.38X PCI-E
Memory Bandwidth 900GB/sec 900GB/sec 720GB/sec 720GB/sec 1.25X
CUDA Cores
(Tensor Cores)
3584 3584

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:
Continue reading

One-shot Learning Methods Applied to Drug Discovery with DeepChem

Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data from FDA clinical trials of drug candidates, where some data sets are derived from failures, due to toxic side effects [2]. For cases where training data is scarce, application of one-shot learning methods have demonstrated significantly improved performance over methods consisting only of graphical convolution networks. The performance of one-shot network architectures will be discussed here for several drug discovery data sets, which are described in Table 1. These data sets, along with one-shot learning methods, have been integrated into the DeepChem deep learning framework, as a result of research published by Altae-Tran, et al. [1]. While data remains scarce for some problem domains, such as drug discovery, one-shot learning methods could pose an important alternative network architecture, which can possibly far outperform methods which use only graphical convolution.

Continue reading

DeepChem – a Deep Learning Framework for Drug Discovery

A powerful new open source deep learning framework for drug discovery is now available for public download on github. This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics. Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but DeepChem is the first to accelerate computation with NVIDIA GPUs.

The framework uses Google TensorFlow, along with scikit-learn, for expressing neural networks for deep learning. It also makes use of the RDKit python framework, for performing more basic operations on molecular data, such as converting SMILES strings into molecular graphs. The framework is now in the alpha stage, at version 0.1. As the framework develops, it will move toward implementing more models in TensorFlow, which use GPUs for training and inference. This new open source framework is poised to become an accelerating factor for innovation in drug discovery across industry and academia.

Continue reading

GPU-accelerated HPC Containers with Singularity

Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Continue reading

NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close

As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it also offers significant processing performance for projects which do not require 64-bit double-precision floating point capability (many neural networks can be trained using the 32-bit single-precision floating point on the Tesla P40). For those cases, these GPUs can be used to accelerate both the neural network training and the inference.

Continue reading

Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs

Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development. For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the past year has an organized set of deep learning benchmarks been published. Called DeepMarks, these deep learning benchmarks are available to all developers who want to get a sense of how their application might perform across various deep learning frameworks.

The benchmarking scripts used for the DeepMarks study are published at GitHub. The original DeepMarks study was run on a Titan X GPU (Maxwell microarchitecture), having 12GB of onboard video memory. Here we will examine the performance of several deep learning frameworks on a variety of Tesla GPUs, including the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 12GB GPUs.

Continue reading