Nagios check ZFS zpool status for READ WRITE CKSUM errors

Recently I’ve been creating Icinga (a port of Nagios) health checks for various ZFS pools running on Solaris 11.2 storage servers, I found this great plugin however found it only alerted based on space remaining within the zpool, which while useful was not enough. I was not able to find a good check that would create an alert if any of the READ, WRITE or CKSUM values in ‘zpool status’ changed from zero to anything else, indicating a problem, so wrote my own health check.

Please note that my bash scripting is not fantastic, there are probably a few ways the script can be improved upon however it does work correctly such as only running ‘zpool status’ once rather than three times. If you have any suggestions please let me know in the comments.

zpool status

In case you are not familiar with the output of the ‘zpool status’ command I have provided a snippet below which will help in understanding how the script works.

root@solaris1:~# zpool status
  pool: zfspool1
 state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
        pool will no longer be accessible on older software versions.
  scan: scrub repaired 0 in 7h39m with 0 errors on Fri Mar  6 09:33:14 2015
config:

        NAME                       STATE     READ WRITE CKSUM
        zfspool1                 ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t5000CAA01C810754d0  ONLINE       0     0     0
            c0t5000CAA01CB02AE8d0  ONLINE       0     0     0
            c0t5000CAA01CB02CF0d0  ONLINE       0     0     0
            c0t5000CAA01CB02E5Cd0  ONLINE       0     0     0
            c0t5000CAA01CB0A81Cd0  ONLINE       0     0     0
            c0t5000CAA01CB0AC80d0  ONLINE       0     0     0
        logs
          c0t5000A7203004D2D2d0    ONLINE       0     0     0
        spares
          c0t5000CAA01BBD3D1Cd0    AVAIL   
          c0t5000CAA01BBCC2F8d0    AVAIL   
          c0t5000CAA01BBCC33Cd0    AVAIL   
          c0t5000CAA01BB25608d0    AVAIL   

Essentially if any of the READ, WRITE, or CKSUM values change from 0 then we want to know about it.

Requirements

Your ZFS server will need to have NRPE installed and configured so that it can be health checked by the Nagios server, I have a guide on how to compile and configure NRPE in Solaris here.

Create the script

On the ZFS server create a file containing the below script. In this instance our zpool_status.sh script is created in the /usr/local/nagios/libexec/ directory as this is where our other checks are found.

/usr/local/nagios/libexec/zpool_status.sh

#!/bin/bash
# ------------------------------------------------------------------
# www.rootusers.com - check_zpool.sh
# Run 'zpool status' and alert if READ, WRITE or CKSUM non 0
# ------------------------------------------------------------------

# Contents of READ/WRITE/CKSUM from "zpool status" are stored in these variables.
# Replace DISKNAME with a common string from your disks listed in "zpool status"
read=$(zpool status | grep DISKNAME | grep -v AVAIL | awk '{print $3}' | grep -v 0 )
write=$(zpool status | grep DISKNAME | grep -v AVAIL | awk '{print $4}' | grep -v 0 )
cksum=$(zpool status | grep DISKNAME | grep -v AVAIL | awk '{print $5}' | grep -v 0 )

# If all variable are empty, that is all variables are 0, then all is well.
# [ -z STRING ]	True of the length if "STRING" is zero.
if [ -z "$read" ] && [ -z "$write" ] && [ -z "$cksum" ]
  then
    echo "OK"
    exit 0

# Else if any variable is not empty, that is there is a non 0 value, generate a critical alert.
# [ -n STRING ] or [ STRING ]	True if the length of "STRING" is non-zero.
elif [ -n "$read" ] || [ -n "$write" ] || [ -n "$cksum" ]
  then
    echo "CRITICAL - Check zpool status for READ/WRITE/CKSUM"
    exit 2

# Any other output is unknown.
else
  echo "UNKNOWN"
  exit 3
fi

To make the script executable run a ‘chmod +x’ against it.

chmod +x /usr/local/nagios/libexec/zpool_status.sh

Script explanation

When the script executes it’s going to run ‘zpool status’ three times and store the results of each into a $read, $write and $cksum variable. After ‘zpool status’ the first grep searches for the names of your disks (DISKNAME) as we only want to select these lines. This is one of the less than ideal parts of the script as you need to manually edit the script to contain a common string from all of the disks, for instance on my server I use “c0t” as all of my disks in ‘zpool status’ start with c0t but this may differ for you, so please change the “grep DISKNAME” as appropriate.

After this the results are piped again to disregard any spare disks in the AVAIL status. Next we pipe the results into ‘awk’ and print the 3rd, 4th, and 5th columns of the results. These are the READ, WRITE and CKSUM values which we want to know if they are 0 or not. This is where the final pipe comes in, by running a grep for non 0 values we only expect a result if there has been a change to READ WRITE or CKSUM from 0.

If statements are then used to trigger the exit status required for Nagios depending on the results of the ‘zpool status’ commands. An exit status of 0 means ok, 1 is warning, 2 is critical and 3 is unknown. In this instance we do not have any warning alerts as we want to treat any detected issue as critical and investigate, however you could modify the script to exit with a warning if for example READ/WRITE/CKSUM is under a specified number.

The OK 0 is returned if the contents of $read and $write and $cksum are all zero, && is used here because we only want to return ok if all 3 variables are 0.

The Critical 2 is returned if the contents of $read or $write or $cksum are non zero, || is used here because we want to return a critical exit status if any of the variables are non 0.

Edit the NRPE configuration

In order for NRPE to be able to use this script we need to edit the nrpe.cfg file, in my Solaris installation this is found at /usr/local/nagios/etc/nrpe.cfg, however in Linux may be found at /etc/nagios/nrpe.cfg

Add the below line into the nrpe.cfg file, there will most likely be a section with commands specified already so add it there to keep things tidy.

command[zpool_status]=/usr/local/nagios/libexec/zpool_status.sh

To apply the change we need to restart NRPE, in Solaris to do this run the below command.

svcadm restart svc:/network/nrpe/tcp:default

This can also be done in Linux with ‘/etc/init.d/nrpe restart’.

Configure the monitoring server

On the Icinga server I added the below to /etc/icinga/objects/hosts/zfs-servers.cfg

define service {
  use generic-service
  hostgroup_name zfs-server
  service_description zpool status
  check_command zpool_status
}

The /etc/icinga/objects/commands.cfg file was updated with:

define command {
  command_name zpool_status 
  command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c zpool_status
}

Summary

This script is able to be configured as an NRPE based health check to report the ‘zpool status’ results of ZFS. If the READ, WRITE or CKSUM values of ‘zpool status’ change to anything that is not 0 then an alert will be triggered prompting further investigation.

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>