Category Archives: Administration

NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management

Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or … Continue reading

nvidia-smi: Control Your GPUs

This post was last updated on 2018-11-05 Most users know how to check the status of their CPUs, see how much system memory is free, or find out how much disk space is free. In contrast, keeping tabs on the … Continue reading

Monitoring Hard Drive and RAID Health

By default, you won’t find out that one of your hard drives has failed until the data is gone. Even if you are using a software or hardware RAID, it will only continue to function if you replace failed drives. … Continue reading

Managing a Linux Software RAID with MDADM

There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most … Continue reading

Take Care When Updating Your Cluster

Although modern Linux distributions have made it very easy to keep your software packages up-to-date, there are some pitfalls you might encounter when managing your compute cluster. Cluster software packages are usually not managed from the same software repository as … Continue reading