Monitoring Hard Drive and RAID Health

By default, you won’t find out that one of your hard drives has failed until the data is gone. Even if you are using a software or hardware RAID, it will only continue to function if you replace failed drives. I have seen RAIDs run in degraded mode for months or years until additional drive failures ruined any chance of data recovery.

Drives and operating systems are designed to work around issues as best they can until absolute failure. However, that doesn’t mean that you can’t monitor the situation and receive an alert as soon as the first problem develops.

If you do not have a dedicated hardware RAID controller, there are two utilities to be configured and started: smartd and mdadm. The smartd daemon reads hard drive S.M.A.R.T. health data directly off the drives and sends alerts of any changes. Similarly, mdadm watches the health of your Linux software RAIDs for any problems.

If you are using a hardware RAID controller, then it manages some of these tasks. However, you must be sure to properly configure the automated alerts within the controller’s management interface – check the manual for full instructions. Additionally, you may be able to monitor hard drive health data if the controller supports it (3ware and ARECA cards are known to work) – see the smartd man page.

smartd

Here are the standard entries I use in /etc/smartd.conf:

/dev/sda -a -d ata -m eliot@example.com -H -l error -l selftest -M test -o on -S on -s (S/../../3/03|L/../15/./04)
/dev/sdb -a -d ata -m eliot@example.com -H -l error -l selftest -o on -S on -s (S/../../3/04|L/../15/./05)

These lines are fairly convoluted. In this example, they monitor drives /dev/sda and /dev/sdb by performing the following tasks:

  • E-mailing all alerts to eliot@example.com
  • Sending one test e-mail upon startup
  • Watch for any critical failure warnings in the SMART data
  • Monitor the results of hard drive self tests
  • Enables Automatic Offline Testing of the drives
  • Run a short self test on each drive once a week (3am for sda; 4am for sdb)
  • Run a long self test on each drive once a month (4am for sda; 5am for sdb)

Once you have written the configuration file, you need to start the service:

/etc/init.d/smartd start

To ensure the service starts at boot, you’ll need to add it to the boot sequence. The exact command depends upon your Linux distribution:

chkconfig --add smartd        (Red Hat, Fedora and SUSE)
rc-update add smartd default  (Gentoo)
update-rc.d mdadm defaults    (Debian)

mdadm

To monitor Linux software RAIDs, you’ll need at least the following lines in /etc/mdadm.conf:

DEVICE /dev/sd[ab]1 /dev/sd[ab]5 /dev/sd[ab]6 /dev/sd[ab]7 /dev/sd[ab]8 /dev/sd[ab]10

ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1
ARRAY /dev/md5 devices=/dev/sda5,/dev/sdb5
ARRAY /dev/md6 devices=/dev/sda6,/dev/sdb6
ARRAY /dev/md7 devices=/dev/sda7,/dev/sdb7
ARRAY /dev/md8 devices=/dev/sda8,/dev/sdb8
ARRAY /dev/md10 devices=/dev/sda10,/dev/sdb10

MAILADDR eliot@example.com

Using this example, any changes to the listed md devices will be immediately e-mailed to eliot@example.com.

Note that some newer versions of mdadm require that devices be identified by UUID (e.g. f4849d33:f8c1ce1c:ac28ac18:9d4741e7) rather than raw device name (e.g. /dev/md1). If this is the case, run mdadm --detail /dev/md1 for each RAID.

Once you have written the configuration file, you need to start the service. Some distributions use the name mdadm and others use mdmonitor:

/etc/init.d/mdadm start

To ensure the service starts at boot, you’ll need to add it to the boot sequence. The exact command depends upon your Linux distribution:

chkconfig --add mdadm        (Red Hat, Fedora and SUSE)
rc-update add mdadm default  (Gentoo)
update-rc.d mdadm defaults   (Debian)

As always, be certain that you use your own e-mail address and the names of the actual hard drives and arrays in your system.

Eliot Eshelman

About Eliot Eshelman

My interests span from astrophysics to bacteriophages; high-performance computers to small spherical magnets. I've been an avid Linux geek (with a focus on HPC) for more than a decade. I work as Microway's Vice President of Strategic Accounts and HPC Initiatives.
This entry was posted in Administration, Hardware and tagged , . Bookmark the permalink.

5 Responses to Monitoring Hard Drive and RAID Health

Leave a Reply

Your email address will not be published. Required fields are marked *