Tesla V100 “Volta” GPU Review

Brett Newman

·

September 28, 2017

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Review Index:

Speeds and Feeds
Which GPU is for me?
Enhanced NVLink
Programming Improvements
What Volta Means for me?

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

	Tesla V100 with NVLink	Tesla V100 PCI-E	Tesla P100 with NVLink	Tesla P100 PCI-E	Ratio Tesla V100:P100
DP TFLOPS	7.8 TFLOPS	7.0 TFLOPS	5.3 TFLOPS	4.7 TFLOPS	~1.4-1.5X
SP TFLOPS	15.7 TFLOPS	14 TFLOPS	9.3 TFLOPS	8.74 TFLOPS	~1.4-1.5X
TensorFLOPS	125 TFLOPS	112 TFLOPS	21.2 TFLOPS 1/2 Precision	18.7 TFLOPS 1/2 Precision	~6X
Interface (bidirec. BW)	300GB/sec	32GB/sec	160GB/sec	32GB/sec	1.88X NVLink 9.38X PCI-E
Memory Bandwidth	900GB/sec	900GB/sec	720GB/sec	720GB/sec	1.25X
CUDA Cores (Tensor Cores)	5120 (640)	5120 (640)	3584	3584

Performance of Tesla GPUs, Generation to Generation

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:

Tesla V100 PCI-E
Tesla V100 with NVLink

Tesla V100 with NVLink

Tesla V100 SXM 2.0 GPU

Where performance matters
Where GPU:GPU or CPU:GPU bandwidth are paramount
Where advanced system typologies are required
Where your priority is Coherence and CORAL system emulation (building mini-versions of Summit & Sierra)

Tesla V100 with NVLink is available today in DGX-1V, NumberSmasher 1U Tesla GPU Server with NVLink, and Power Systems AC922

Performance Summary

Increases in relative performance are widely workload dependent. But early testing demonstates HPC performance advancing approximately 50%, in just a 12 month period.
Tesla V100 HPC PerformanceTesla V100 HPC Performance
If you haven’t made the jump to Tesla P100 yet, Tesla V100 is an even more compelling proposition.

For Deep Learning, Tesla V100 delivers a massive leap in performance. This extends to training time, and it also brings unique advancements for inference as well.
Deep Learning Performance Summary -Tesla V100

Enhancements for Every Workload

Now that we’ve learned a bit about each GPU, let’s review the breakthrough advancements for Tesla V100.

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

NVLink helps you to transcend the bottlenecks in PCI-Express based systems and dramatically improve the data flow to GPU accelerators.

The total data pipe for each Tesla V100 with NVLink GPUs is now 300GB/sec of bidirectional bandwidth—that’s nearly 10X the data flow of PCI-E x16 3.0 interfaces!

Bandwidth

NVIDIA has enhanced the bandwidth of each individual NVLink brick by 25%: from 40GB/sec (20+20GB/sec in each direction) to 50GB/sec (25+25GB/sec) of bidirectional bandwidth.

This improved signaling helps carry the load of data intensive applications.

Number of Bricks

But enhanced NVLink isn’t just about simple signaling improvements. Point to point NVLink connections are divided into “bricks” or links. Each brick delivers

From 4 bricks to 6

Over and above the signaling improvements, Tesla V100 with NVLink increases the number of NVLink “bricks” that are embedded into each GPU: from 4 bricks to 6.

This 50% increase in brick quantity delivers a large increase in bandwidth for the world’s most data intensive workloads. It also allows for more diverse set of system designs and configurations.

The NVLink Bank Account

NVIDIA NVLink technology continues to be a breakthrough for both HPC and Deep Learning workloads. But as with previous generations of NVLink, there’s a design choice related to a simple question:

Where do you want to spend your NVLink bandwidth?

Think about NVLink bricks as a “spending” or “bank account.” Each NVLink system-design strikes a different balance between where they “spend the funds.” You may wish to:

Spend links on GPU:GPU communication
Focus on increasing the number of GPUs
Broaden CPU:GPU bandwidth to overcome the PCI-E bottleneck
Create a balanced system, or prioritize design choices solely for a single workload

There are good reasons for each, or combinations, of these choices. DGX-1V, NumberSmasher 1U GPU Server with NVLink, and future IBM Power Systems products all set different priorities.

We’ll dive more deeply into these choices in a future post.

Net-net: the NVLink bandwidth increases from 160GB/sec to 300GB/sec (bidirectional), and a number of new, diverse HPC and AI hardware designs are now possible.

Programming Improvements

Tesla V100 and CUDA 9 bring a host of improvements for GPU programmers. These include:

Cooperative Groups
A new L1 cache + shared memory, that simplifies programming
A new SIMT model, that relieves the need to program to fit 32 thread warps

We won’t explore these in detail in this post, but we encourage you to read the following resources for more:

What does Tesla V100 mean for me?

Tesla V100 GPUs mean a massive change for many workloads. But each workload will see different impacts. Here’s a short summary of what we see:

An increase in performance on HPC applications (on paper FLOPS increase of 50%, diverse real-world impacts)
A massive leap for Deep Learning Training
1 GPU, many Deep Learning workloads
New system designs, better tuned to your applications
Radical new, and radically simpler, programming paradigms (coherence)

Tesla V100 PCI-E

Tesla V100 with NVLink

Bandwidth

Number of Bricks

From 4 bricks to 6

Implementing NVIDIA AI Blueprint

Microway Achieves DGX SuperPOD Specialization Partner Status with NVIDIA

DGX A100 review: Throughput and Hardware Summary

Tesla V100 “Volta” GPU Review

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

Selecting the right Tesla V100 for you:

Performance Summary

Enhancements for Every Workload

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

Bandwidth

Number of Bricks

From 4 bricks to 6

The NVLink Bank Account

Where do you want to spend your NVLink bandwidth?

Programming Improvements

What does Tesla V100 mean for me?

You May Also Like

Implementing NVIDIA AI Blueprint

Microway Achieves DGX SuperPOD Specialization Partner Status with NVIDIA

DGX A100 review: Throughput and Hardware Summary