Common Maintenance Tasks (Clusters)

Nate Conley

·

March 5, 2024

The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers).

Backup non-replaceable data

Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this task or you will forget.

Compute clusters are built from a large group of computers, so there are many different places for data to hide. Make users aware of your backup policies and be certain they aren’t storing vital data on the compute nodes. Let them know which areas are scratch space (for temporary files) and which areas are regularly backed up and designed for user data.

Strongly consider keeping a backup image of the entire head node installation (including a copy of the compute node software image). Bare-metal recovery software is available if you’re not certain how to do this yourself.

As for the user data:

For many groups, a weekly or monthly cron job is fine. Write a script calling rsync or tar which writes the files to a separate server, NAS or SAN. Place the script in /etc/cron.weekly/ or /etc/cron.monthly/
Users with more complex requirements should look at AMANDA or Bacula
Tape backup systems are still available for those who prefer them. Contact us.

Verify the health of your Storage

Drive sectors can go bad silently. Scheduling regular verifies will weed out any issues before they occur. Automate them or you will forget.

Linux Software RAID (mdadm) arrays can be easily kicked into verify mode. Many distributions (Red Hat, CentOS, Ubuntu) come with their own utilities. To manually start a verify, run this line for each RAID (as root):
echo check > /sys/block/md#/md/sync_action
Watch the text file /proc/mdstat and the output of dmesg to watch the status of each verify.
Hardware RAID controllers provide their own methods for automated verifies and alert notification. Reference the controller’s manual.
Enterprise and parallel storage systems typically provide their own management interfaces (separate from your cluster management software). Familiarize yourself with these interfaces and enable e-mail alerts.

Monitor system alarms and system health

If Microway provided you with a preconfigured cluster, then we performed the software integration before the cluster arrived at your site. The cluster can monitor its own health (via MCMS™ or Bright Cluster Manager), but you should familiarize yourself with the user interface and double-check that e-mail alerts are being sent to the correct e-mail address.

Each system in the cluster also supports traditional monitoring and management features:

Preferred: learn how to use the IPMI capability for remote monitoring and management. You’ll spend a lot less time trekking to the datacenter.
Alternative: listen for system alarms and check for warning LEDs.

Don’t ignore alarms! If you put it off, you’ll soon find that something else is wrong and your cluster needs to be rebuilt from scratch.

Schedule and Test System Software Updates

Although modern Linux distributions have made it very easy to keep software packages up-to-date, there are some pitfalls an administrator might encounter when updating software on a compute cluster.

Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater may unknowingly break compatibility. In particular, upgrading or changing the Linux kernel on your cluster may require manual re-configuration – particularly for systems with large/parallel storage, InfiniBand and/or GPU compute processor components. These types of systems usually require that kernel modules or other packages be recompiled against the new kernel. Test updates on a single system before making such changes on the entire cluster!

Please keep in mind that updating the software on your cluster may break existing functionality, so don’t update just for the sake of updating! Plan an update schedule and notify users in case there is downtime from unexpected snags.

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs

In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators