NVIDIA Tesla V100 Price Analysis

Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU model Price Double-Precision Performance (FP64) Dollars per TFLOPS Deep Learning Performance (TensorFLOPS or 1/2 Precision) Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB
or 32GB
$10,664*
$11,458* for 32GB
7 TFLOPS $1,523
$1,637 for 32GB
112 TFLOPS $95.21
$102.30 for 32GB
Tesla P100 PCI-E 16GB $7,374* 4.7 TFLOPS $1,569 18.7 TFLOPS $394.33
Tesla V100 SXM 16GB
or 32GB
$10,664*
$11,458* for 32GB
7.8 TFLOPS $1,367
$1,469 for 32GB
125 TFLOPS $85.31
$91.66 for 32GB
Tesla P100 SXM2 16GB $9,428* 5.3 TFLOPS $1,779 21.2 TFLOPS $444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

  • Tesla V100 delivers a big advance in absolute performance, in just 12 months
  • Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
  • Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
  • Tesla P100 remains a reasonable price/performance GPU choice, in select situations
  • Tesla P100 will still dramatically outperform a CPU-only configuration

Continue reading

NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management

Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.

Executing hardware or health checks

DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:

Report what GPUs are installed, in which slots and PCI-E trees and make a group

Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.

Determine GPU link states, bandwidths

Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)

Read temps, boost states, power consumption, or utilization

Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster

Driver versions and CUDA versions

Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system

Run sample jobs and integrated validation

Run basic diagnostics and sample jobs that are built into the DCGM package.

Set policies

DCGM provide a mechanism to set policies to a group of GPUs.

Continue reading

Designing A Production-Class AI Cluster

Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search engine image classification, and cancer detection in biomedical imaging. Most businesses have collected troves of data or incorporated new avenues to collect data in recent years. Through the innovations of deep learning, that same data can be used to gain insight, make accurate predictions, and pave the path to discovery.

Developing a plan to integrate AI workloads into an existing business infrastructure or research group presents many challenges. However, there are two key elements that will drive the decisions to customizing an AI cluster. First, understanding the types and volumes of data is paramount to beginning to understand the computational requirements of training the neural network. Secondly, understanding the business expectation for time to result is equally important. Each of these factors influence the first and second stages of the AI workload, respectively. Underestimating the data characteristics will result in insufficient computational and infrastructure resources to train the networks in a reasonable timeframe. Moreover, underestimating the value and requirement of time-to-results can fail to deliver ROI to the business or hamper research results.

Below are summaries of the different features of system design that must be evaluated when configuring an AI cluster.

Continue reading

Tesla V100 “Volta” GPU Review

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

Performance of Tesla GPUs, Generation to Generation
Tesla V100 with NVLink Tesla V100 PCI-E Tesla P100 with NVLink Tesla P100 PCI-E Ratio Tesla V100:P100
DP TFLOPS 7.8 TFLOPS 7.0 TFLOPS 5.3 TFLOPS 4.7 TFLOPS ~1.4-1.5X
SP TFLOPS 15.7 TFLOPS 14 TFLOPS 9.3 TFLOPS 8.74 TFLOPS ~1.4-1.5X
TensorFLOPS 125 TFLOPS 112 TFLOPS 21.2 TFLOPS 1/2 Precision 18.7 TFLOPS 1/2 Precision ~6X
Interface (bidirec. BW)
300GB/sec 32GB/sec 160GB/sec 32GB/sec 1.88X NVLink
9.38X PCI-E
Memory Bandwidth 900GB/sec 900GB/sec 720GB/sec 720GB/sec 1.25X
CUDA Cores
(Tensor Cores)
5120
(640)
5120
(640)
3584 3584

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:
Continue reading

One-shot Learning Methods Applied to Drug Discovery with DeepChem

Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data from FDA clinical trials of drug candidates, where some data sets are derived from failures, due to toxic side effects [2]. For cases where training data is scarce, application of one-shot learning methods have demonstrated significantly improved performance over methods consisting only of graphical convolution networks. The performance of one-shot network architectures will be discussed here for several drug discovery data sets, which are described in Table 1. These data sets, along with one-shot learning methods, have been integrated into the DeepChem deep learning framework, as a result of research published by Altae-Tran, et al. [1]. While data remains scarce for some problem domains, such as drug discovery, one-shot learning methods could pose an important alternative network architecture, which can possibly far outperform methods which use only graphical convolution.

Continue reading

DeepChem – a Deep Learning Framework for Drug Discovery

A powerful new open source deep learning framework for drug discovery is now available for public download on github. This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics. Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but DeepChem is the first to accelerate computation with NVIDIA GPUs.

The framework uses Google TensorFlow, along with scikit-learn, for expressing neural networks for deep learning. It also makes use of the RDKit python framework, for performing more basic operations on molecular data, such as converting SMILES strings into molecular graphs. The framework is now in the alpha stage, at version 0.1. As the framework develops, it will move toward implementing more models in TensorFlow, which use GPUs for training and inference. This new open source framework is poised to become an accelerating factor for innovation in drug discovery across industry and academia.

Continue reading

GPU-accelerated HPC Containers with Singularity

Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Continue reading

NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close

As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it also offers significant processing performance for projects which do not require 64-bit double-precision floating point capability (many neural networks can be trained using the 32-bit single-precision floating point on the Tesla P40). For those cases, these GPUs can be used to accelerate both the neural network training and the inference.

Continue reading

Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs

Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development. For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the past year has an organized set of deep learning benchmarks been published. Called DeepMarks, these deep learning benchmarks are available to all developers who want to get a sense of how their application might perform across various deep learning frameworks.

The benchmarking scripts used for the DeepMarks study are published at GitHub. The original DeepMarks study was run on a Titan X GPU (Maxwell microarchitecture), having 12GB of onboard video memory. Here we will examine the performance of several deep learning frameworks on a variety of Tesla GPUs, including the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 12GB GPUs.

Continue reading

Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers

The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology.

Photo of NVIDIA Tesla P100 NVLink GPUs in an OpenPOWER server

Continue reading