Cover V14, i12
dec2005.tar

Taming Nagios

David Josephsen

In the past few years, Nagios has become the industry standard open source systems monitoring tool. If you're using an open source app to monitor the availability, state, or utilization of your servers or network gear, then chances are you are using Nagios to do it. To those who have worked with it, this is no surprise. The lightweight design of Nagios offloads the actual query logic into "plug-ins", which are easily created, modified, and re-purposed by sys admins. The lack of complex query logic leaves the Nagios daemon free to manage scheduling and notifications and to handle UI. Nagios's "keep it simple" approach makes it straightforward to administer, network transparent, and amazingly flexible.

Two excellent articles by Syed Ali in previous editions of Sys Admin covered the installation and configuration of Nagios. In this article, I'll pick up where those articles left off and provide some creative solutions to problems commonly faced by sys admins working with Nagios to monitor the health and performance of systems.

Wrapping Around Plug-Ins

In my experience at work and in the forums, I've noticed that sys admins dealing with Nagios for the first time invariably ask two questions. The first is "Why can't I get any performance data?" The Nagios daemon has hooks for exporting performance data that it receives from the plug-ins to external programs. These hooks are usually used to provide data to graphing programs like RRDTool or MRTG. The problem lies in that the plug-ins themselves must provide the performance data for this to work, and the preponderance of plug-ins do not. Despite a very straightforward interface design, it seems that performance data remains an afterthought in the fast-paced world of Nagios plug-in development.

Thankfully, the missing support for performance data is not difficult for blue collar sys admins to add. The Nagios daemon considers anything after a pipe character in the plug-in's normal output to be performance data and exports it accordingly.

For example, if the plug-in's output is "cpu:15%|15", then Nagios will place "cpu:15%" in the output pane of the Web GUI and export "15" to whatever applications are configured to receive performance data. So, on the forums, the common answer to this question is to hack the source of the plug-in in question, and echo its output twice. Since most plug-ins are written in C, this usually means changing this:

printf("%s",output);
to something like this:

printf("%s|%s",output,output);
and recompiling the plug-in.

But we're busy people, and we can do better than that. Listing 1 is a "generic plug-in wrapper", written in sh, which will add performance data support to any plug-in. It works by proxying the "real" plug-in and capturing its output and exit code. Then, it simply echoes the captured output twice and exits with the captured exit code. To use it, copy it to the server's plug-ins directory (usually /usr/local/nagios/libexec) as "check_wrapper_generic". Then, to provide performance data for your "check_mem" plug-in, you would do:

cd /usr/local/nagios/libexec

ln -s check_wrapper_generic check_mem_wrapper

Then, replace all instances of check_mem with check_mem_wrapper in your checkcommands.cfg file, and you're done. You now have a common, system-wide interface for selective support of performance data.

The second frequently asked question is "How do I create a service that checks more than one port?" Nagios comes with check_tcp and check_udp plug-ins that can be configured to check any single tcp or udp port. The problem is that some applications span multiple ports, and it feels kludgey to sys admins to configure multiple logical services for a single physical entity. It's not just aesthetics, a junior admin or manager would easily be confused by three different "Oracle" services in the Nagios UI. Does red in one mean that the Oracle listener is down?

Listing 2 is another plug-in "wrapper", written in bash (or any shell with getopts support), which will wrap around the existing check_(tcp|udp) plug-in to easily provide for an arbitrary number of space-separated port numbers in a single service definition. Called with -u, it wraps around check_udp. Called with -t, it wraps around check_tcp. This way you can have a single service instantiation that checks multiple ports. Other than multiple port numbers, and a switch for tcp or udp, its syntax is very similar to the normal check_tcp plug-in. For example, the following:

check_multi_tcp -H server1 -t -p "21 22"
would check ports 21 and 22 on server1. Call it with -h for a technical description of its syntax.

In Search of Humanity

Tools like NACE, which create config files based on automated network discovery, ease the headache of managing the config files, but some configuration will always be manual. Automatic discovery of systems is one thing, automatic discovery of human contacts and which services they are interested in receiving notifications for is quite another. Or is it?

Tying Nagios to listmanager software, like MajorDomo or EZMLM, gives users the capability to simply tell us what they're interested in by "subscribing" to the Nagios services from which they want notifications. We configure Nagios to know about the lists and mail them instead of the contacts directly, thereby ceasing our contact configuration conundrums.

Getting this to work is a three-step process. Listing 3 is a shell script that, given a list of services in the form of "server service" on stdin, will create an EZMLM list for each one. There is a filesystem interface to the current Nagios system state in your Nagios var directory. Usually this is "/usr/local/nagios/var/status". So you can generate the input for Listing 3 with:

find /usr/local/nagios/var/status -type f | sed \
  -e 's/.*\/\([^\/]\+\)\/\([^\/]\+\)$/\1 \2/'
EZMLM lists consist of a lot of files, but some of these files are universal; they will be the same for every list. So, Listing 3 uses a template. Every list hard-links to the files in the template, thus saving four inodes per service. These "universal" files are headerremove, inhost, lock, outhost, and public. Check the EZMLM docs for the proper configuration of these files and place them in the location expected by the script. Inhost and outhost need the FQDN of your Nagios box, and headerremove probably needs to contain at least "return-path" and "return-receipt-to", depending on your network specifics.

Once the lists are created, the second step is making Nagios use them. To begin, add a contact for a generic forwarder:

define contact{
        use                             default-template
        contact_name                    listforwarder
        alias                           Ezmlm relay script
        service_notification_commands   service-relay-to-list
        host_notification_commands      host-relay-to-list
        email                           nothing@nowhere.com
  }
Then add that contact to every configured "contactgroup".

Finally, add the service-relay-to-list and host-relay-to-list commands in your misccommands.cfg:

# for forwarding to EZMLM Lists
define command{
        command_name    host-relay-to-list
        command_line    /usr/bin/printf "%b" "Host '$HOSTALIAS$' \
                        is $HOSTSTATE$\nInfo: $HOSTOUTPUT$\nTime: \
                        $DATETIME$\n\n\nUser:$CONTACTNAME$" | \
                        /usr/local/nagios/sbin/listforwarder.sh \
                        "$HOSTNAME$" "host" "'echo \
                        $NOTIFICATIONTYPE$ | /usr/bin/cut \
                        -c '1-3' ': $HOSTNAME$ $HOSTSTATE$" \
                        2>&1 | logger -p mail.info
        }

# for forwarding to EZMLM Lists
define command{
        command_name    service-relay-to-list
        command_line    /usr/bin/printf "%b" "Service: \
                        $SERVICEDESC$\nHost: \
                        $HOSTNAME$\n$HOSTALIAS$\nAddress: \
                        $HOSTADDRESS$\nState: $SERVICESTATE$\nInfo: \
                        $SERVICEOUTPUT$\nDate: \
                        $DATETIME$\n\n\nUser:$CONTACTNAME$" | \
                        /usr/local/nagios/sbin/listforwarder.sh \
                        "$HOSTNAME$" "$SERVICEDESC$" "'echo \
                        $NOTIFICATIONTYPE$ | /usr/bin/cut -c \
                        '1-3'': $HOSTNAME$/$SERVICEDESC$ \
                        $SERVICESTATE$"  2>&1 | logger -p mail.info
        }
Listing 4 is the listforwarder.sh script referred to in the commands above. With this in place, people interested in getting notifications from the "CPU" service on "server1" can send an email to "server1-CPU-subscribe@your.nagios.box". The built-in contacts interface is not affected by this arrangement, so you can still manually maintain groups of contacts within Nagios where it makes sense to do so.

Now that all the lists have been created and Nagios is configured to make use of them, the third and final step is to ensure new lists are created when new services and hosts are added going forward.

Failsafe Changes

Changing the configs is dangerous business, because errors in the config files bring down the running Nagios daemon, and it will stay down for as long as it takes you to correct the bugs. Arrangements like the mailing list tip above further complicate things. Admins simply won't remember to create new lists for new hosts and services. A simple shell script could satisfy that requirement by checking for changes to the configs and creating new mailing lists accordingly. In practice, however, it's not so easy. You need to either run the script from cron, risking a race condition if user requests beat the cron job, or proceduralize changes to the production Nagios daemon to make sure nobody forgets to run the script. So while we're creating procedures, it would be nice if we could also provide for some rollback functionality and error checking.

We do this with the help of CVS. Our production Nagios configs are checked into a CVS repository, and that repository is checked out in /etc/ on our production server. This enables admins to work on the files offline, ensures that they don't step on each other by directly editing the live configs in production, and provides a revision history they can fall back on if there are problems. Once their changes are made and committed, our admins use the shell script in Listing 5 on the production server to actually change the running Nagios daemon. They're provided sudo access to this script, so that they can't do it any other way.

This shell script gets their changes from CVS, checks them to make sure there are no errors (with nagios -v), safely HUPs the running Nagios daemon, and then creates any necessary EZMLM lists by providing input to the script in Listing 4. If bugs are found in the configs, it dies without stopping Nagios. If unauthorized changes are made, they are overwritten. Providing all of this in a single command made the process so easy and transparent for our admins that many of them assumed Listing 5 was a program provided by the Nagios installation tarball and wondered what had happened to it when they moved to other Nagios implementations. Indeed, once you have a mechanism for failsafe changes, it seems obvious for one to be built in.

Getting What They Want

"I'd like notification A to be sent to my pager, during the day, and to my email at night, and notification B sent to my email, but only on weekends, and never on the 15th of the month, unless it's the second Wednesday in May." Sound familiar? Don't be upset, this is actually a good thing. You are experiencing the symptoms of a monitoring system that actually works. Complex configurations like this are at least possible with Nagios, if not easy. You'll need multiple instantiations of the same service for different contacts at different time periods.

We've had some luck outsourcing some of this conditional logic to special mail aliases. The condition for the email to be sent at night, for example, could be offloaded to a Qmail alias called ".qmail-nighttime-default" containing:

|condredirect $EXT2 /usr/local/bin/nightchecker
Nightchecker is a simple shell script that makes sure it's currently "night time" (as defined by our SLA) and exits accordingly:

#!/bin/sh
myhour='date '+%k''
if [ $myhour -lt '7' ]
then
   if [ $myhour -ge '0' ]
   then
      exit 0
   fi
elif [ $myhour -gt '19' ]
then
   exit 0
else
   exit 99
fi
So, with this in place, you can have a single service instantiation that emails user-nighttime@your.nagios.box, and the mail will only make it if it's currently night time. The nice thing about these aliases is that you can stack them. So, given another called .qmail-weekends and yet another called .qmail-secondWedInMay, your service could email: user-nighttime-weekends-secondWedInMay@your.nagios.box. It won't replace timeperiods.cfg, but it has helped us out of some tight spots.

Acknowledged

Nagios allows alert recipients to "acknowledge" problems, thereby stopping the recurring problem notifications while optionally providing a helpful comment to your fellow admins. I find that managers love the thought of this feature but find it difficult to implement in practice. It's just hard to take the time to log into a Web interface to acknowledge a problem while a production system is down.

We thought of a quicker way. Listing 6 is a shell script that enables alert recipients to acknowledge problems by replying to the notification email they receive. It works by way of the Nagios command file, so be careful with this one; it puts data controlled by unknown agents into your command fifo.

The script is designed to be given the email on stdin. It checks the message to make sure it's not a bounce and parses various pieces of information out of it, such as the hostname, service name, and an optional comment from the person who ack'd. You'll need to make sure that the return path of notifications is a routable address that dumps the reply into this script. You'll also want to add the "CONTACTNAME" macro to your notify commands in misccommands.cfg, prefixed with "User:". Here's what ours looks like:

define command{
        command_name    host-notify-by-muttpager-short
        command_line    /usr/bin/printf "%b" "Host '$HOSTALIAS$' \
                        is $HOSTSTATE$\nInfo: $HOSTOUTPUT$\nTime: \
                        $DATETIME$\n\n\nUser:$CONTACTNAME$" | \
                        /usr/bin/mutt -F /etc/nagios/muttrc.cfg \
                        -e 'set realname=""' -s "'echo \
                        $NOTIFICATIONTYPE$ | /usr/bin/cut -c \
                        '1-3'': $HOSTNAME$ $HOSTSTATE$" \
                        $CONTACTPAGER$ 2>&1 | logger -p mail.info
        }
		
Capturing Business Logic

Management, for the most part, is interested in system health in the context of business processes. They don't care that the smtp process on Mail4 is down; they care about whether email is currently working in general. Red light or green light -- is "email" working right now? Capturing business logic is, in my opinion, the holy grail of network and systems monitoring, and the big commercial monitoring apps (again, in my opinion) do a woefully inadequate job at it. Nagios doesn't quite have the built-in hooks to capture the business processes around its host and service definitions either, but there are some things that can help you get close.

The concept of "email" as a business process is, in reality, a myriad of complex interactions between systems. To answer the "red light or green light" question, you need to aggregate the status of many services on numerous hosts into a singularly instantiated entity.

Check_cluster2, in the contrib directory can do this for you to some degree. Check_cluster2 was written to report the overall status of a cluster by checking the status information of each individual host or service cluster element. It works by simply taking a definition of the "cluster" as a list of services and hosts, and exiting 0 if they're all up. It checks their status using the SERVICESTATEID and HOSTSTATEID macros. It's not a huge cognitive leap to simply consider "email" as a cluster of services and write the definition accordingly.

That gets you 90% of the way there, but there are two big limitations to check_cluster2. The first is that you cannot define conditional logic. Either all the cluster elements are up, or they aren't, and this doesn't reflect the reality of business processes very well. We'd like to see a version with definition syntax similar to lisp conditionals, or ldap search syntax. Something like this:

(||(host1-smtp)(host2-smtp))
would reflect that either the smtp service on host1 or host2 could be up and "email" would still be up. The second limitation is that since check_cluster2 uses internal Nagios macros, it must be used from within Nagios and won't allow itself to be scripted externally. I'd like to see a version that used the filesystem interface instead.

In reality, "email" can't be represented by a red or green light. Its health can only be measured by degree. But that won't stop management from asking you to give them the stop light. So, when you turn it red, you had better be able to explain why. Nagvis, a PHP program from some fellows in Germany, can help you answer that question.

NagVis allows you to easily animate Visio-style diagrams with real-time information from Nagios; it can give you graphical "click-through" explanations of your aggregation decisions. Using motion gifs, "lights" that correspond to hosts and services can be made to blink on and off. It's like catnip for managers, but at the price of yet another config file.

Combining check_cluster2 and Nagvis gets us some rudimentary management views. For example, Figure 1 is an example of a simple "Business Logic" interface. Several important business functions are listed, along with their status (red, yellow, or green). If someone wanted to know what "Corporate Email" consisted of, they could click through it to the Nagvis diagram depicted in Figure 2.

Each green dot on the Nagvis diagram is configured to point to what we call a "rollup" service. It's actually a check_cluster2 service configured to watch every "email-related" service on its parent server. If any of these services goes down, so does the rollup service, and the Nagvis diagram responds with a flashing red dot like the one next to the "deprecated" server in Figure 2.

Conclusion

Nagios is a great tool for monitoring systems and networks. I hope I've given you some ideas that will help you expand its use, manage its complexity, and most of all, make your life easier.

References

Check_cluster2 plug-in -- http://nagios.sourceforge.net/docs/2_0/clusters.html

CVS -- http://www.nongnu.org/cvs/

EZMLM -- http://www.ezmlm.org/

Majordomo -- http://www.greatcircle.com/majordomo/

MRTG -- http://people.ee.ethz.ch/~oetiker/webtools/mrtg/

Mutt -- http://www.mutt.org/

NACE -- http://www.adamsinfoserv.com/AISTWiki/bin/view/AIS/NACE

Nagvis -- http://www.nagvis.org/

Qmail -- http://cr.yp.to/qmail.html

RRDTool -- http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/

Syed, Ali. 2003. "Network Monitoring with Nagios", Sys Admin, October 2003 -- http://www.samag.com/documents/s=8892/sam0310c/sam0310c.htm

Syed, Ali. 2003. "Advanced Configuration of Nagios", Sys Admin, November 2003 -- http://www.samag.com/documents/s=8920/sam0311i/0311i.htm

David Josephsen is a Senior Systems Engineer for VHA Inc. in Irving, Texas. He seeks interesting problems and can be reached via email at: dave@homer.cymry.org.