Recently I’ve been creating Icinga (a port of Nagios) health checks for various ZFS pools running on Solaris 11.2 storage servers, I found this great plugin however found it only alerted based on space remaining within the zpool, which while useful was not enough. I was not able to find a good check that would create an alert if any of the READ, WRITE or CKSUM values in ‘zpool status’ changed from zero to anything else, indicating a problem, so wrote my own health check.
Please note that my bash scripting is not fantastic, there are probably a few ways the script can be improved upon however it does work correctly such as only running ‘zpool status’ once rather than three times. If you have any suggestions please let me know in the comments.
zpool status
In case you are not familiar with the output of the ‘zpool status’ command I have provided a snippet below which will help in understanding how the script works.
root@solaris1:~# zpool status pool: zfspool1 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 7h39m with 0 errors on Fri Mar 6 09:33:14 2015 config: NAME STATE READ WRITE CKSUM zfspool1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c0t5000CAA01C810754d0 ONLINE 0 0 0 c0t5000CAA01CB02AE8d0 ONLINE 0 0 0 c0t5000CAA01CB02CF0d0 ONLINE 0 0 0 c0t5000CAA01CB02E5Cd0 ONLINE 0 0 0 c0t5000CAA01CB0A81Cd0 ONLINE 0 0 0 c0t5000CAA01CB0AC80d0 ONLINE 0 0 0 logs c0t5000A7203004D2D2d0 ONLINE 0 0 0 spares c0t5000CAA01BBD3D1Cd0 AVAIL c0t5000CAA01BBCC2F8d0 AVAIL c0t5000CAA01BBCC33Cd0 AVAIL c0t5000CAA01BB25608d0 AVAIL
Essentially if any of the READ, WRITE, or CKSUM values change from 0 then we want to know about it.
Requirements
Your ZFS server will need to have NRPE installed and configured so that it can be health checked by the Nagios server, I have a guide on how to compile and configure NRPE in Solaris here.
Create the script
On the ZFS server create a file containing the below script. In this instance our zpool_status.sh script is created in the /usr/local/nagios/libexec/ directory as this is where our other checks are found.
/usr/local/nagios/libexec/zpool_status.sh
#!/bin/bash # ------------------------------------------------------------------ # www.rootusers.com - check_zpool.sh # Run 'zpool status' and alert if READ, WRITE or CKSUM non 0 # ------------------------------------------------------------------ # Contents of READ/WRITE/CKSUM from "zpool status" are stored in these variables. # Replace DISKNAME with a common string from your disks listed in "zpool status" read=$(zpool status | grep DISKNAME | grep -v AVAIL | awk '{print $3}' | grep -v 0 ) write=$(zpool status | grep DISKNAME | grep -v AVAIL | awk '{print $4}' | grep -v 0 ) cksum=$(zpool status | grep DISKNAME | grep -v AVAIL | awk '{print $5}' | grep -v 0 ) # If all variable are empty, that is all variables are 0, then all is well. # [ -z STRING ] True of the length if "STRING" is zero. if [ -z "$read" ] && [ -z "$write" ] && [ -z "$cksum" ] then echo "OK" exit 0 # Else if any variable is not empty, that is there is a non 0 value, generate a critical alert. # [ -n STRING ] or [ STRING ] True if the length of "STRING" is non-zero. elif [ -n "$read" ] || [ -n "$write" ] || [ -n "$cksum" ] then echo "CRITICAL - Check zpool status for READ/WRITE/CKSUM" exit 2 # Any other output is unknown. else echo "UNKNOWN" exit 3 fi
To make the script executable run a ‘chmod +x’ against it.
chmod +x /usr/local/nagios/libexec/zpool_status.sh
Script explanation
When the script executes it’s going to run ‘zpool status’ three times and store the results of each into a $read, $write and $cksum variable. After ‘zpool status’ the first grep searches for the names of your disks (DISKNAME) as we only want to select these lines. This is one of the less than ideal parts of the script as you need to manually edit the script to contain a common string from all of the disks, for instance on my server I use “c0t” as all of my disks in ‘zpool status’ start with c0t but this may differ for you, so please change the “grep DISKNAME” as appropriate.
After this the results are piped again to disregard any spare disks in the AVAIL status. Next we pipe the results into ‘awk’ and print the 3rd, 4th, and 5th columns of the results. These are the READ, WRITE and CKSUM values which we want to know if they are 0 or not. This is where the final pipe comes in, by running a grep for non 0 values we only expect a result if there has been a change to READ WRITE or CKSUM from 0.
If statements are then used to trigger the exit status required for Nagios depending on the results of the ‘zpool status’ commands. An exit status of 0 means ok, 1 is warning, 2 is critical and 3 is unknown. In this instance we do not have any warning alerts as we want to treat any detected issue as critical and investigate, however you could modify the script to exit with a warning if for example READ/WRITE/CKSUM is under a specified number.
The OK 0 is returned if the contents of $read and $write and $cksum are all zero, && is used here because we only want to return ok if all 3 variables are 0.
The Critical 2 is returned if the contents of $read or $write or $cksum are non zero, || is used here because we want to return a critical exit status if any of the variables are non 0.
Edit the NRPE configuration
In order for NRPE to be able to use this script we need to edit the nrpe.cfg file, in my Solaris installation this is found at /usr/local/nagios/etc/nrpe.cfg, however in Linux may be found at /etc/nagios/nrpe.cfg
Add the below line into the nrpe.cfg file, there will most likely be a section with commands specified already so add it there to keep things tidy.
command[zpool_status]=/usr/local/nagios/libexec/zpool_status.sh
To apply the change we need to restart NRPE, in Solaris to do this run the below command.
svcadm restart svc:/network/nrpe/tcp:default
This can also be done in Linux with ‘/etc/init.d/nrpe restart’.
Configure the monitoring server
On the Icinga server I added the below to /etc/icinga/objects/hosts/zfs-servers.cfg
define service { use generic-service hostgroup_name zfs-server service_description zpool status check_command zpool_status }
The /etc/icinga/objects/commands.cfg file was updated with:
define command { command_name zpool_status command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c zpool_status }
Summary
This script is able to be configured as an NRPE based health check to report the ‘zpool status’ results of ZFS. If the READ, WRITE or CKSUM values of ‘zpool status’ change to anything that is not 0 then an alert will be triggered prompting further investigation.
0 Comments.