Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs

Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development. For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the past year has an organized set of deep learning benchmarks been published. Called DeepMarks, these deep learning benchmarks are available to all developers who want to get a sense of how their application might perform across various deep learning frameworks.

The benchmarking scripts used for the DeepMarks study are published at GitHub. The original DeepMarks study was run on a Titan X GPU (Maxwell microarchitecture), having 12GB of onboard video memory. Here we will examine the performance of several deep learning frameworks on a variety of Tesla GPUs, including the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 12GB GPUs.

Data from Deep Learning Benchmarks

The deep learning frameworks covered in this benchmark study are TensorFlow, Caffe, Torch, and Theano. All deep learning benchmarks were single-GPU runs. The benchmarking scripts used in this study are the same as those found at DeepMarks. DeepMarks runs a series of benchmarking scripts which report the time required for a framework to process one forward propagation step, plus one backpropagation step. The sum of both comprises one training iteration. The times reported are the times required for one training iteration per batch, in milliseconds.

To start, we ran CPU-only trainings of each neural network. We then ran the same trainings on each type of GPU. The plot below depicts the ranges of speedup that were obtained via GPU acceleration.

Plot of deep learning benchmark results across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Figure 1. GPU speedup ranges over CPU-only trainings – geometrically averaged across all four framework types and all four neural network types.

If we expand the plot and show the speedups for the different types of neural networks, we see that some types of networks undergo a larger speedup than others.

Plot of deep learning benchmark speedups (with geometric averages) for each network on Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Figure 2. GPU speedups over CPU-only trainings – geometrically averaged across all four deep learning frameworks. The speedup ranges from Figure 1 are uncollapsed into values for each neural network architecture.

If we take a step back and look at the ranges of speedups the GPUs provide, there is a fairly wide range of speedup. The plot below shows the full range of speedups measured (without geometrically averaging across the various deep learning frameworks). Note that the ranges are widened and become overlapped.

Plot of deep learning benchmark results (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Figure 3. Speedup factor ranges without geometric averaging across frameworks. Range is taken across set of runtimes for all framework/network pairs.

We believe the ranges resulting from geometric averaging across frameworks (as shown in Figure 1) results in narrower distributions and appears to be a more accurate quality measure than is shown in Figure 3. However, it is instructive to expand the plot from Figure 3 to show each deep learning framework. Those ranges, as shown below, demonstrate that your neural network training time will strongly depend upon which deep learning framework you select.

Plot of deep learning benchmark results for each framework (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs

Figure 4. GPU speedups over CPU-only trainings – showing the range of speedups when training four neural network types. The speedup ranges from Figure 3 are uncollapsed into values for each deep learning framework.

As shown in all four plots above, the Tesla P100 PCIe GPU provides the fastest speedups for neural network training. With that in mind, the plot below shows the raw training times for each type of neural network on each of the four deep learning frameworks.

Plot of deep learning benchmark training iteration times for each framework on Tesla P100 16GB PCIe GPUs

Figure 5. Training iteration times (in milliseconds) for each deep learning framework and neural network architecture (as measured on the Tesla P100 16GB PCIe GPU).

We provide more discussion below. For reference, we have listed the measurements from each set of tests.

Tesla P100 16GB PCIe Benchmark Results

AlexNet Overfeat GoogLeNet VGG (ver.a) Speedup Over CPU
Caffe 80 288 279 393 (35x ~ 70x speedups)
TensorFlow 46 144 253 277 (16x ~ 40x speedups)
Theano 161 482 624 2075 (19x ~ 43x speedups)
cuDNN-fp32 (Torch) 44 107 247 222 (33x ~ 41x speedups)
geometric average over frameworks 71 215 331 473 (29x ~ 42x speedups)

Table 1: Benchmarks were run on a single Tesla P100 16GB PCIe GPU. Times reported are in msec per batch. The batch size for all training iterations measured for runtime in this study is 128, except for VGG net, which uses a batch size of 64.

Tesla K80 Benchmark Results

AlexNet Overfeat GoogLeNet VGG (ver.a) Speedup Over CPU
Caffe 365 1,187 1,236 1,747 (9x ~ 15x speedups)
TensorFlow 181 622 979 1,104 (4x ~ 10x speedups)
Theano 515 1,716 1,793 (8x ~ 16x speedups)
cuDNN-fp32 (Torch) 171 379 914 743 (9x ~ 12x speedups)
geometric average over frameworks 276 832 1,187 1,127 (9x ~ 11x speedups)

Table 2: Benchmarks were run on a single Tesla K80 GPU chip. Times reported are in msec per batch.

Tesla M40 Benchmark Results

AlexNet Overfeat GoogLeNet VGG (ver.a) Speedup Over CPU
Caffe 128 448 468 637 (22x ~ 53x speedups)
TensorFlow 82 273 418 498 (10x ~ 22x speedups)
Theano 245 786 963 (17x ~ 28x speedups)
cuDNN-fp32 (Torch) 79 182 433 400 (19x ~ 22x speedups)
geometric average over frameworks 119 364 534 506 (20x ~ 27x speedups)

Table 3: Benchmarks were run on a single Tesla M40 GPU. Times reported are in msec per batch.

CPU-only Benchmark Results

AlexNet Overfeat GoogLeNet VGG (ver.a)
Caffe 4,529 10,350 18,545 14,010
TensorFlow 1,823 5,275 4,018 7,341
Theano 5,275 13,579 26,829 38,687
cuDNN-fp32 (Torch) 1,838 3,604 8,234 9,166
geometric average over frameworks 2,991 7,190 11,326 13,819

Table 4: Benchmarks were run on dual Xeon E5-2690v4 processors in a system with 256GB RAM. Times reported are in msec per batch.


When geometric averaging is applied across framework runtimes, a range of speedup values is derived for each GPU, as shown in Figure 1. CPU times are also averaged geometrically across framework type. These results indicate that the greatest speedups are realized with the Tesla P100, with the Tesla M40 ranking second, and the Tesla K80 yielding the lowest speedup factors. Figure 2 shows the range of speedup values by network architecture, uncollapsed from the ranges shown in Figure 1.

The speedup ranges for runtimes not geometrically averaged across frameworks are shown in Figure 3. Here the set of all runtimes corresponding to each framework/network pair is considered when determining the range of speedups for each GPU type. Figure 4 shows the speedup ranges by framework, uncollapsed from the ranges shown in figure 3. The degree of overlap in Figure 3 suggests that geometric averaging across framework type yields a better measure of GPU performance, with more narrow and distinct ranges resulting for each GPU type, as shown in Figure 1.

The greatest speedups were observed when comparing Caffe forward+backpropagation runtime to CPU runtime, when solving the GoogLeNet network model. Caffe generally showed speedups larger than any other framework for this comparison, ranging from 35x to ~70x (see Figure 4 and Table 1). Despite the higher speedups, Caffe does not turn out to be the best performing framework on these benchmarks (see Figure 5). When comparing runtimes on the Tesla P100, Torch performs best and has the shortest runtimes (see Figure 5). Note that although the VGG net tends to be the slowest of all, it does train faster then GooLeNet when run on the Torch framework (see Figure 5).

The data show that Theano and TensorFlow display similar speedups on GPUs (see Figure 4). Despite the fact that Theano sometimes has larger speedups than Torch, Torch and TensorFlow outperform Theano. While Torch and TensorFlow yield similar performance, Torch performs slightly better with most network / GPU combinations. However, TensorFlow outperforms Torch in most cases for CPU-only training (see Table 4).

Theano is outperformed by all other frameworks, across all benchmark measurements and devices (see Tables 1 – 4). Figure 5 shows the large runtimes for Theano compared to other frameworks run on the Tesla P100. It should be noted that since VGG net was run with a batch size of only 64, compared to 128 with all other network architectures, the runtimes can sometimes be faster with VGG net, than with GoogLeNet. See, for example, the runtimes for Torch, on GoogLeNet, compared to VGG net, across all GPU devices (Tables 1 – 3).

Deep Learning Benchmark Conclusions

The single-GPU benchmark results show that speedups over CPU increase from Tesla K80, to Tesla M40, and finally to Tesla P100, which yields the greatest speedups (Table 5, Figure 1) and fastest runtimes (Table 6).

Range of Speedups, by GPU type

Tesla P100 16GB PCIe Tesla M40 12GB Tesla K80
19x ~ 70x 10x ~ 53x 4x ~ 16x

Table 5: Measured speedups for running various deep learning frameworks on GPUs (see Table 1)

Fastest Runtime for VGG net, by GPU type

Tesla P100 16GB PCIe Tesla M40 12GB Tesla K80
222 408 743

Table 6: Absolute best runtimes (msec / batch) across all frameworks for VGG net (ver. a). The Torch framework provides the best VGG runtimes, across all GPU types.

The results show that of the tested GPUs, Tesla P100 16GB PCIe yields the absolute best runtime, and also offers the best speedup over CPU-only runs. Regardless of which deep learning framework you prefer, these GPUs offer valuable performance boosts.

Benchmark Setup

Microway’s GPU Test Drive compute nodes were used in this study. Each is configured with 256GB of system memory and dual 14-core Intel Xeon E5-2690v4 processors (with a base frequency of 2.6GHz and a Turbo Boost frequency of 3.5GHz). Identical benchmark workloads were run on the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 GPUs. The batch size is 128 for all runtimes reported, except for VGG net (which uses a batch size of 64). All deep learning frameworks were linked to the NVIDIA cuDNN library (v5.1), instead of their own native deep network libraries. This is because linking to cuDNN yields better performance than using the native library of each framework.

When running benchmarks of Theano, slightly better runtimes resulted when CNMeM, a CUDA memory manager, is used to manage the GPU’s memory. By setting lib.cnmem=0.95, the GPU device will have CNMeM manage 95% of its memory:
THEANO_FLAGS='floatX=float32,device=gpu0,lib.cnmem=0.95,allow_gc=True' python ...

Notes on Tesla M40 versus Tesla K80

The data demonstrate that Tesla M40 outperforms Tesla K80. When geometrically averaging runtimes across frameworks, the speedup of the Tesla K80 ranges from 9x to 11x, while for the Tesla M40, speedups range from 20x to 27x. The same relationship exists when comparing ranges without geometric averaging. This result is expected, considering that the Tesla K80 card consists of two separate GK210 GPU chips (connected by a PCIe switch on the GPU card). Since the benchmarks here were run on single GPU chips, the benchmarks reflect only half the throughput possible on a Tesla K80 GPU. If running a perfectly parallel job, or two separate jobs, the Tesla K80 should be expected to approach the throughput of a Tesla M40.

Singularity Containers

Logo image of the Singularity projectSingularity is a new type of container designed specifically for HPC environments. Singularity enables the user to define an environment within the container, which might include customized deep learning frameworks, NVIDIA device drivers, and the CUDA 8.0 toolkit. The user can copy and transport this container as a single file, bringing their customized environment to a different machine where the host OS and base hardware may be completely different. The container will process the workflow within it to execute in the host’s OS environment, just as it does in its internal container environment. The workflow is pre-defined inside of the container, including and necessary library files, packages, configuration files, environment variables, and so on.

In order to facilitate benchmarking of four different deep learning frameworks, Singularity containers were created separately for Caffe, TensorFlow, Theano, and Torch. Given its simplicity and powerful capabilities, you should expect to hear more about Singularity soon.


Deep Learning Benchmarks published on GitHub

Containers for Full User Control of Environment

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Sermanet, Pierre, et al. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).

Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).

John Murphy

About John Murphy

My background in HPC includes building two clusters at the University of Massachusetts, along with doing computational research in quantum chemistry on the facilities of the Massachusetts Green High Performance Computing Center. My personal interests and academic background encompass a range of topics across science and engineering. In recent work, I used the GAMESS quantum chemistry package in order to study theoretical highly strained hydrocarbon structures, derived from prismane building blocks. I also recently authored a small software application in Python for generating amorphous cellulose within a periodic space. This application was used for generating structures for further study in NAMD and LAMMPS. Prior to doing research in Quantum and Materials Chemistry, I worked on problems related to protein folding and docking. It is very exciting, especially, to be involved with applications of GPU computing to these, as well as to other scientific questions. For several years, while not doing research, I was a consulting software engineer and built a variety of internet and desktop software applications. As an HPC Sales Specialist at Microway, I greatly look forward to advising Microway's clients in order to provide them with well-configured, optimal HPC solutions.
This entry was posted in Benchmarking, Hardware and tagged , , , , . Bookmark the permalink.

One Response to Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs

  1. Pingback: Deep Learning Research Directions: Computational Efficiency - Tim Dettmers

Leave a Reply

Your email address will not be published. Required fields are marked *