Taming
Nagios
David Josephsen
In the past few years, Nagios has become the industry standard
open source systems monitoring tool. If you're using an open source
app to monitor the availability, state, or utilization of your servers
or network gear, then chances are you are using Nagios to do it.
To those who have worked with it, this is no surprise. The lightweight
design of Nagios offloads the actual query logic into "plug-ins",
which are easily created, modified, and re-purposed by sys admins.
The lack of complex query logic leaves the Nagios daemon free to
manage scheduling and notifications and to handle UI. Nagios's "keep
it simple" approach makes it straightforward to administer, network
transparent, and amazingly flexible.
Two excellent articles by Syed Ali in previous editions of Sys
Admin covered the installation and configuration of Nagios.
In this article, I'll pick up where those articles left off and
provide some creative solutions to problems commonly faced by sys
admins working with Nagios to monitor the health and performance
of systems.
Wrapping Around Plug-Ins
In my experience at work and in the forums, I've noticed that
sys admins dealing with Nagios for the first time invariably ask
two questions. The first is "Why can't I get any performance data?"
The Nagios daemon has hooks for exporting performance data that
it receives from the plug-ins to external programs. These hooks
are usually used to provide data to graphing programs like RRDTool
or MRTG. The problem lies in that the plug-ins themselves must provide
the performance data for this to work, and the preponderance of
plug-ins do not. Despite a very straightforward interface design,
it seems that performance data remains an afterthought in the fast-paced
world of Nagios plug-in development.
Thankfully, the missing support for performance data is not difficult
for blue collar sys admins to add. The Nagios daemon considers anything
after a pipe character in the plug-in's normal output to be performance
data and exports it accordingly.
For example, if the plug-in's output is "cpu:15%|15", then Nagios
will place "cpu:15%" in the output pane of the Web GUI and export
"15" to whatever applications are configured to receive performance
data. So, on the forums, the common answer to this question is to
hack the source of the plug-in in question, and echo its output
twice. Since most plug-ins are written in C, this usually means
changing this:
printf("%s",output);
to something like this:
printf("%s|%s",output,output);
and recompiling the plug-in.
But we're busy people, and we can do better than that. Listing
1 is a "generic plug-in wrapper", written in sh, which will add
performance data support to any plug-in. It works by proxying the
"real" plug-in and capturing its output and exit code. Then, it
simply echoes the captured output twice and exits with the captured
exit code. To use it, copy it to the server's plug-ins directory
(usually /usr/local/nagios/libexec) as "check_wrapper_generic".
Then, to provide performance data for your "check_mem" plug-in,
you would do:
cd /usr/local/nagios/libexec
ln -s check_wrapper_generic check_mem_wrapper
Then, replace all instances of check_mem with check_mem_wrapper
in your checkcommands.cfg file, and you're done. You now have a common,
system-wide interface for selective support of performance data.
The second frequently asked question is "How do I create a service
that checks more than one port?" Nagios comes with check_tcp
and check_udp plug-ins that can be configured to check any
single tcp or udp port. The problem is that some applications span
multiple ports, and it feels kludgey to sys admins to configure
multiple logical services for a single physical entity. It's not
just aesthetics, a junior admin or manager would easily be confused
by three different "Oracle" services in the Nagios UI. Does red
in one mean that the Oracle listener is down?
Listing 2 is another plug-in "wrapper", written in bash (or any
shell with getopts support), which will wrap around the existing
check_(tcp|udp) plug-in to easily provide for an arbitrary
number of space-separated port numbers in a single service definition.
Called with -u, it wraps around check_udp. Called
with -t, it wraps around check_tcp. This way you can
have a single service instantiation that checks multiple ports.
Other than multiple port numbers, and a switch for tcp or udp, its
syntax is very similar to the normal check_tcp plug-in. For example,
the following:
check_multi_tcp -H server1 -t -p "21 22"
would check ports 21 and 22 on server1. Call it with -h for
a technical description of its syntax.
In Search of Humanity
Tools like NACE, which create config files based on automated
network discovery, ease the headache of managing the config files,
but some configuration will always be manual. Automatic discovery
of systems is one thing, automatic discovery of human contacts and
which services they are interested in receiving notifications for
is quite another. Or is it?
Tying Nagios to listmanager software, like MajorDomo or EZMLM,
gives users the capability to simply tell us what they're interested
in by "subscribing" to the Nagios services from which they want
notifications. We configure Nagios to know about the lists and mail
them instead of the contacts directly, thereby ceasing our contact
configuration conundrums.
Getting this to work is a three-step process. Listing 3 is a shell
script that, given a list of services in the form of "server service"
on stdin, will create an EZMLM list for each one. There is a filesystem
interface to the current Nagios system state in your Nagios var
directory. Usually this is "/usr/local/nagios/var/status". So you
can generate the input for Listing 3 with:
find /usr/local/nagios/var/status -type f | sed \
-e 's/.*\/\([^\/]\+\)\/\([^\/]\+\)$/\1 \2/'
EZMLM lists consist of a lot of files, but some of these files are
universal; they will be the same for every list. So, Listing 3 uses
a template. Every list hard-links to the files in the template, thus
saving four inodes per service. These "universal" files are headerremove,
inhost, lock, outhost, and public. Check the EZMLM docs for the proper
configuration of these files and place them in the location expected
by the script. Inhost and outhost need the FQDN of your Nagios box,
and headerremove probably needs to contain at least "return-path"
and "return-receipt-to", depending on your network specifics.
Once the lists are created, the second step is making Nagios use
them. To begin, add a contact for a generic forwarder:
define contact{
use default-template
contact_name listforwarder
alias Ezmlm relay script
service_notification_commands service-relay-to-list
host_notification_commands host-relay-to-list
email nothing@nowhere.com
}
Then add that contact to every configured "contactgroup".
Finally, add the service-relay-to-list and host-relay-to-list
commands in your misccommands.cfg:
# for forwarding to EZMLM Lists
define command{
command_name host-relay-to-list
command_line /usr/bin/printf "%b" "Host '$HOSTALIAS$' \
is $HOSTSTATE$\nInfo: $HOSTOUTPUT$\nTime: \
$DATETIME$\n\n\nUser:$CONTACTNAME$" | \
/usr/local/nagios/sbin/listforwarder.sh \
"$HOSTNAME$" "host" "'echo \
$NOTIFICATIONTYPE$ | /usr/bin/cut \
-c '1-3' ': $HOSTNAME$ $HOSTSTATE$" \
2>&1 | logger -p mail.info
}
# for forwarding to EZMLM Lists
define command{
command_name service-relay-to-list
command_line /usr/bin/printf "%b" "Service: \
$SERVICEDESC$\nHost: \
$HOSTNAME$\n$HOSTALIAS$\nAddress: \
$HOSTADDRESS$\nState: $SERVICESTATE$\nInfo: \
$SERVICEOUTPUT$\nDate: \
$DATETIME$\n\n\nUser:$CONTACTNAME$" | \
/usr/local/nagios/sbin/listforwarder.sh \
"$HOSTNAME$" "$SERVICEDESC$" "'echo \
$NOTIFICATIONTYPE$ | /usr/bin/cut -c \
'1-3'': $HOSTNAME$/$SERVICEDESC$ \
$SERVICESTATE$" 2>&1 | logger -p mail.info
}
Listing 4 is the listforwarder.sh script referred to in the commands
above. With this in place, people interested in getting notifications
from the "CPU" service on "server1" can send an email to "server1-CPU-subscribe@your.nagios.box".
The built-in contacts interface is not affected by this arrangement,
so you can still manually maintain groups of contacts within Nagios
where it makes sense to do so.
Now that all the lists have been created and Nagios is configured
to make use of them, the third and final step is to ensure new lists
are created when new services and hosts are added going forward.
Failsafe Changes
Changing the configs is dangerous business, because errors in
the config files bring down the running Nagios daemon, and it will
stay down for as long as it takes you to correct the bugs. Arrangements
like the mailing list tip above further complicate things. Admins
simply won't remember to create new lists for new hosts and services.
A simple shell script could satisfy that requirement by checking
for changes to the configs and creating new mailing lists accordingly.
In practice, however, it's not so easy. You need to either run the
script from cron, risking a race condition if user requests beat
the cron job, or proceduralize changes to the production Nagios
daemon to make sure nobody forgets to run the script. So while we're
creating procedures, it would be nice if we could also provide for
some rollback functionality and error checking.
We do this with the help of CVS. Our production Nagios configs
are checked into a CVS repository, and that repository is checked
out in /etc/ on our production server. This enables admins to work
on the files offline, ensures that they don't step on each other
by directly editing the live configs in production, and provides
a revision history they can fall back on if there are problems.
Once their changes are made and committed, our admins use the shell
script in Listing 5 on the production server to actually change
the running Nagios daemon. They're provided sudo access to this
script, so that they can't do it any other way.
This shell script gets their changes from CVS, checks them to
make sure there are no errors (with nagios -v), safely HUPs
the running Nagios daemon, and then creates any necessary EZMLM
lists by providing input to the script in Listing 4. If bugs are
found in the configs, it dies without stopping Nagios. If unauthorized
changes are made, they are overwritten. Providing all of this in
a single command made the process so easy and transparent for our
admins that many of them assumed Listing 5 was a program provided
by the Nagios installation tarball and wondered what had happened
to it when they moved to other Nagios implementations. Indeed, once
you have a mechanism for failsafe changes, it seems obvious for
one to be built in.
Getting What They Want
"I'd like notification A to be sent to my pager, during the day,
and to my email at night, and notification B sent to my email, but
only on weekends, and never on the 15th of the month, unless it's
the second Wednesday in May." Sound familiar? Don't be upset, this
is actually a good thing. You are experiencing the symptoms of a
monitoring system that actually works. Complex configurations like
this are at least possible with Nagios, if not easy. You'll need
multiple instantiations of the same service for different contacts
at different time periods.
We've had some luck outsourcing some of this conditional logic
to special mail aliases. The condition for the email to be sent
at night, for example, could be offloaded to a Qmail alias called
".qmail-nighttime-default" containing:
|condredirect $EXT2 /usr/local/bin/nightchecker
Nightchecker is a simple shell script that makes sure it's currently
"night time" (as defined by our SLA) and exits accordingly:
#!/bin/sh
myhour='date '+%k''
if [ $myhour -lt '7' ]
then
if [ $myhour -ge '0' ]
then
exit 0
fi
elif [ $myhour -gt '19' ]
then
exit 0
else
exit 99
fi
So, with this in place, you can have a single service instantiation
that emails user-nighttime@your.nagios.box, and the mail will only
make it if it's currently night time. The nice thing about these aliases
is that you can stack them. So, given another called .qmail-weekends
and yet another called .qmail-secondWedInMay, your service could email:
user-nighttime-weekends-secondWedInMay@your.nagios.box. It won't replace
timeperiods.cfg, but it has helped us out of some tight spots.
Acknowledged
Nagios allows alert recipients to "acknowledge" problems, thereby
stopping the recurring problem notifications while optionally providing
a helpful comment to your fellow admins. I find that managers love
the thought of this feature but find it difficult to implement in
practice. It's just hard to take the time to log into a Web interface
to acknowledge a problem while a production system is down.
We thought of a quicker way. Listing 6 is a shell script that
enables alert recipients to acknowledge problems by replying to
the notification email they receive. It works by way of the Nagios
command file, so be careful with this one; it puts data controlled
by unknown agents into your command fifo.
The script is designed to be given the email on stdin. It checks
the message to make sure it's not a bounce and parses various pieces
of information out of it, such as the hostname, service name, and
an optional comment from the person who ack'd. You'll need to make
sure that the return path of notifications is a routable address
that dumps the reply into this script. You'll also want to add the
"CONTACTNAME" macro to your notify commands in misccommands.cfg,
prefixed with "User:". Here's what ours looks like:
define command{
command_name host-notify-by-muttpager-short
command_line /usr/bin/printf "%b" "Host '$HOSTALIAS$' \
is $HOSTSTATE$\nInfo: $HOSTOUTPUT$\nTime: \
$DATETIME$\n\n\nUser:$CONTACTNAME$" | \
/usr/bin/mutt -F /etc/nagios/muttrc.cfg \
-e 'set realname=""' -s "'echo \
$NOTIFICATIONTYPE$ | /usr/bin/cut -c \
'1-3'': $HOSTNAME$ $HOSTSTATE$" \
$CONTACTPAGER$ 2>&1 | logger -p mail.info
}
Capturing Business Logic
Management, for the most part, is interested in system health
in the context of business processes. They don't care that the smtp
process on Mail4 is down; they care about whether email is currently
working in general. Red light or green light -- is "email" working
right now? Capturing business logic is, in my opinion, the holy
grail of network and systems monitoring, and the big commercial
monitoring apps (again, in my opinion) do a woefully inadequate
job at it. Nagios doesn't quite have the built-in hooks to capture
the business processes around its host and service definitions either,
but there are some things that can help you get close.
The concept of "email" as a business process is, in reality, a
myriad of complex interactions between systems. To answer the "red
light or green light" question, you need to aggregate the status
of many services on numerous hosts into a singularly instantiated
entity.
Check_cluster2, in the contrib directory can do this for you to
some degree. Check_cluster2 was written to report the overall status
of a cluster by checking the status information of each individual
host or service cluster element. It works by simply taking a definition
of the "cluster" as a list of services and hosts, and exiting 0
if they're all up. It checks their status using the SERVICESTATEID
and HOSTSTATEID macros. It's not a huge cognitive leap to simply
consider "email" as a cluster of services and write the definition
accordingly.
That gets you 90% of the way there, but there are two big limitations
to check_cluster2. The first is that you cannot define conditional
logic. Either all the cluster elements are up, or they aren't, and
this doesn't reflect the reality of business processes very well.
We'd like to see a version with definition syntax similar to lisp
conditionals, or ldap search syntax. Something like this:
(||(host1-smtp)(host2-smtp))
would reflect that either the smtp service on host1 or host2 could
be up and "email" would still be up. The second limitation is that
since check_cluster2 uses internal Nagios macros, it must be used
from within Nagios and won't allow itself to be scripted externally.
I'd like to see a version that used the filesystem interface instead.
In reality, "email" can't be represented by a red or green light.
Its health can only be measured by degree. But that won't stop management
from asking you to give them the stop light. So, when you turn it
red, you had better be able to explain why. Nagvis, a PHP program
from some fellows in Germany, can help you answer that question.
NagVis allows you to easily animate Visio-style diagrams with
real-time information from Nagios; it can give you graphical "click-through"
explanations of your aggregation decisions. Using motion gifs, "lights"
that correspond to hosts and services can be made to blink on and
off. It's like catnip for managers, but at the price of yet another
config file.
Combining check_cluster2 and Nagvis gets us some rudimentary management
views. For example, Figure 1 is an example of a simple "Business
Logic" interface. Several important business functions are listed,
along with their status (red, yellow, or green). If someone wanted
to know what "Corporate Email" consisted of, they could click through
it to the Nagvis diagram depicted in Figure 2.
Each green dot on the Nagvis diagram is configured to point to
what we call a "rollup" service. It's actually a check_cluster2
service configured to watch every "email-related" service on its
parent server. If any of these services goes down, so does the rollup
service, and the Nagvis diagram responds with a flashing red dot
like the one next to the "deprecated" server in Figure 2.
Conclusion
Nagios is a great tool for monitoring systems and networks. I
hope I've given you some ideas that will help you expand its use,
manage its complexity, and most of all, make your life easier.
References
Check_cluster2 plug-in -- http://nagios.sourceforge.net/docs/2_0/clusters.html
CVS -- http://www.nongnu.org/cvs/
EZMLM -- http://www.ezmlm.org/
Majordomo -- http://www.greatcircle.com/majordomo/
MRTG -- http://people.ee.ethz.ch/~oetiker/webtools/mrtg/
Mutt -- http://www.mutt.org/
NACE -- http://www.adamsinfoserv.com/AISTWiki/bin/view/AIS/NACE
Nagvis -- http://www.nagvis.org/
Qmail -- http://cr.yp.to/qmail.html
RRDTool -- http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/
Syed, Ali. 2003. "Network Monitoring with Nagios", Sys Admin,
October 2003 -- http://www.samag.com/documents/s=8892/sam0310c/sam0310c.htm
Syed, Ali. 2003. "Advanced Configuration of Nagios", Sys Admin,
November 2003 -- http://www.samag.com/documents/s=8920/sam0311i/0311i.htm
David Josephsen is a Senior Systems Engineer for VHA Inc. in
Irving, Texas. He seeks interesting problems and can be reached
via email at: dave@homer.cymry.org. |