Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.
Executing hardware or health checks
DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:
Report what GPUs are installed, in which slots and PCI-E trees and make a group
Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.
Determine GPU link states, bandwidths
Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)
Read temps, boost states, power consumption, or utilization
Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster
Driver versions and CUDA versions
Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system
Run sample jobs and integrated validation
Run basic diagnostics and sample jobs that are built into the DCGM package.
Set policies
DCGM provide a mechanism to set policies to a group of GPUs.