GPU-accelerated HPC Containers with Singularity

Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Singularity helps you to step away from the complex dependencies of your software apps. It enables you to assemble these complex toolchains into a single unified tool that you can use just as simply as you’d use any built-in Linux command. A tool that can be moved from system to system without effort.

Surprising Simplicity

Of course, HPC tools are traditionally quite complex, so users seem to expect Singularity containers to also be complex. Just as virtualization is hard for novices to wrap their heads around, the operation of Singularity containers can be disorienting. For that reason, I encourage you to think of your Singularity containers as a single file; a single tool. It’s an executable that you can use just like any other program. It just happens to have all its dependencies built in.

This means it’s not doing anything tricky with your data files. It’s not doing anything tricky with the network. It’s just a program that you’ll be running like any other. Just like any other program, it can read data from any of your files; it can write data to any local directory you specify. It can download data from the network; it can accept connections from the network. InfiniBand, Omni-Path and/or MPI are fully supported. Once you’ve created it, you really don’t think of it as a container anymore.

GPU-accelerated HPC Containers

When it comes to utilizing the GPUs, Singularity will see the same GPU devices as the host system. It will respect any device selections or restrictions put in place by the workload manager (e.g., SLURM). You can package your applications into GPU-accelerated HPC containers and leverage the flexibilities provided by Singularity. For example, run Ubuntu containers on an HPC cluster that uses CentOS Linux; run binaries built for CentOS on your Ubuntu system.

As part of this effort, we have contributed a Singularity image for TensorFlow back to the Singularity community. This image is available pre-built for all users on our GPU Test Drive cluster. It’s a fantastically easy way to compare the performance of CPU-only and GPU-accelerated versions of TensorFlow. All one needs to do is switch between executables:

Executing the pre-built TensorFlow for CPUs

[eliot@node2 ~]$ tensorflow_cpu ./hello_world.py
Hello, TensorFlow!
42

Executing the pre-built TensorFlow with GPU acceleration

[eliot@node2 ~]$ tensorflow_gpu ./hello_world.py
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:06:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:07:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB

[...]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
Hello, TensorFlow!
42

As shown above, the tensorflow_cpu and tensorflow_gpu executables include everything that’s needed for TensorFlow. You can just think of them as ready-to-run applications that have all their dependencies built in. All you need to know is where the Singularity container image is stored on the filesystem.

Caveats of GPU-accelerated HPC containers with Singularity

In earlier versions of Singularity, the nature of NVIDIA GPU drivers required a couple extra steps during the configurations of GPU-accelerated containers. Although GPU support is still listed as experimental, Singularity now offers a --nv flag which passes through the appropriate driver/library files. In most cases, you will find that no additional steps are needed to access NVIDIA GPUs with a Singularity container. Give it a try!

Taking the next step on GPU-accelerated HPC containers

There are still many use cases left to be discovered. Singularity containers open up a lot of exciting capabilities. As an example, we are leveraging Singularity on our OpenPower systems (which provide full NVLink connectivity between CPUs and GPUs). All the benefits of Singularity are just as relevant on these platforms. The Singularity images cannot be directly transferred between x86 and POWER8 CPUs, but the same style Singularity recipes may be used. Users can run a pre-built Tensorflow image on x86 nodes and a complimentary image on POWER8 nodes. They don’t have to keep all the internals and dependencies in mind as they build their workflows.

Generating reproducible results is another anticipated benefit of Singularity. Groups can publish complete and ready-to-run containers alongside their results. Singularity’s flexibility will allow those containers to continue operating flawlessly for years to come – even if they move to newer hardware or different operating system versions.

If you’d like to see Singularity in action for yourself, request an account on our GPU Test Drive cluster. For those looking to deploy systems and clusters leveraging Singularity, we provide fully-integrated HPC clusters with Singularity ready-to-run. We can also assist by building optimized libraries, applications, and containers. Contact an HPC expert.

This post was updated 2017-06-02 to reflect recent changes in GPU support.

Eliot Eshelman

About Eliot Eshelman

My interests span from astrophysics to bacteriophages; high-performance computers to small spherical magnets. I've been an avid Linux geek (with a focus on HPC) for more than a decade. I work as Microway's Vice President of Strategic Accounts and HPC Initiatives.
This entry was posted in Software, Test Drive and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *