Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.
Executing hardware or health checks
DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:
Report what GPUs are installed, in which slots and PCI-E trees and make a group
Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.
Determine GPU link states, bandwidths
Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)
Read temps, boost states, power consumption, or utilization
Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster
Driver versions and CUDA versions
Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system
Run sample jobs and integrated validation
Run basic diagnostics and sample jobs that are built into the DCGM package.
DCGM provide a mechanism to set policies to a group of GPUs.
Policy driven management: elevating from “what’s happening” to “what can I do”
Simply accessing data about your GPUs is only of modest use. The power of DCGM is in how it arms you to take act upon that data. DCGM allows administrators to take programmatic or preventative action when “it’s not right”
Here’s are a few scenarios where data provided by DCGM allows for both powerful control of your hardware and action:
Scenario 1: Healthchecks – periodic or before the job
Run a check before each job, after a job, or daily/hourly to ensure a cluster is performing optimally.
This allows you to preemptively stop a run if diagnostics fail or move GPUs/nodes out of the scheduling queue for the next job.
Scenario 2: Resource Allocation
Jobs often need a certain class of node (ex: with >4 GPUs or with IB & GPUs on the same PCI-E tree). DCGM can be used to report on the capabilities of a node and help identify appropriate resources.
Users/schedulers can subsequently send jobs only where they are capable of being executed
Scenario 3: “Personalities”
Some codes request specific CUDA or NVIDIA driver versions. DCGM can be used to probe the CUDA version/NVIDIA GPU driver version on a compute node.
Users can then script the deployment of alternate versions or the launch containerized apps to support non-standard versions.
Scenario 4: Stress tests
Periodically stress test GPUs in a cluster with integrated functions
Stress tests like Microway GPU Checker can tease out failing GPUs, and reading data via DCGM during or after can identify bad nodes to be sidelined.
Scenario 5: Power Management
Programmatically set GPU Boost or max TDP levels for an application or run. This allow you to eke out extra performance.
Alternatively, set your GPUs to stay within a certain power band to reduce electricity costs when rates are high or lower total cluster consumption when there is insufficient generation capacity.
Scenario 6: Logging for Validation
Script the pull of error logs and take action with that data.
You can accumulate error logging over time, and determine tendencies of your cluster. For example, a group of GPUs with consistently high temperatures may indicate a hotspot in your datacenter
Getting Started with DCGM: Starting a Health Check
DCGM can be used in many ways. We won’t explore them all here, but it’s important to understand the ease of use of these capabilities.
Here’s the code for a simple health check and also for a basic diagnostic:
dcgmi health --check -g 1
dcgmi diag –g 1 -r 1
The syntax is very standard and includes dcgmi, the command, and the group of GPUs (you must set a group first). In the diagnostic, you include the level of diagnostics requested (-r 1, or lowest level here).
DCGM and Cluster Management
While the Microway team loves advanced scripting, you may prefer integrating DCGM or its capabilities in with your existing schedulers or cluster managers. The following are supported or already leverage DCGM today:
What will you do with DCGM or DCGM-enabled tools? We’ve only scratched the surface. There are extensive resources on how to use DCGM and/or how it is integrated with other tools. We recommend this blog post and the GTC session embedded below: