Unix
Monitoring Scripts
Damir Delija
It is impossible to do systems administration without monitoring
and alerting tools. Basically, these tools are scripts, and writing
such monitoring scripts is an ancient part of systems administration
that's often full of dangerous mistakes and misconceptions.
The traditional way of putting systems together is very stochastic
and erratic, and that same method is often followed when developing
monitoring tools. It is really rare to find a system that's
been properly planned and designed from the start. The usual approach
when something goes wrong is just to patch the immediate problem.
Often, there are strange results from people making mistakes when
they're in a hurry and under pressure.
Monitoring scripts are traditionally fired from root cron and
send results by email. These emails can accumulate over time, flooding
people with strange mails, creating problems on the monitored system,
and causing other unexpected situations. Such scenarios are often
unavoidable, because few enterprises can afford better measures
than firefighting. In this article, I will mention a few tips that
can be helpful when developing monitoring scripts, and I will provide
three sample scripts.
What is a Unix Monitoring Script?
A monitoring tool or script is part of system management and to
be really efficient must be part of an enterprise-wide effort, not
a standalone tool. Its purpose is to detect problems and send alerts
or, rarely, to try to correct the problem. Basically, a monitoring/alerting
tool consists of four different parts:
1. Configuration -- Defines the environment and does initializations,
sets the defaults, etc.
2. Sensor -- Collects data from the system or fetches pre-stored
data.
3. Conditions -- Decides whether events are fired.
4. Actions -- Takes action if events are fired.
If these elements are simply bundled into a script without thinking,
the script will be ineffective and un-adaptable. Good tools also
include an abstraction layer added to simplify things later, when
modifications are done.
To begin, we have to set some values, do some sanity checks, and
even determine whether monitoring is allowed. In some situations,
it is good to stop monitoring through the control file to avoid
false notifications, during maintenance for example. This is all
done in the configuration part of the script.
The script collects values from the system -- from monitored
processes or the environment. This data collecting is done by the
sensor part. This data can be the output of an external command
or can be fetched from previously stored values, such as the current
df output or previously stored df values (see Listing
1).
The conditions part of the script defines the events that are
monitored. Each condition detects whether an event has happened
and whether this is the start or the end of the event (arming or
rearming). This process can compare current values to predefined
limits or to stored values, if we are interested in rates instead
of absolute values. Events can also be based on composite or calculated
values, such as "Average idle from sar for the last 5 minutes
is less than 10%" (see Listing 2).
Results at this level are logical values usually presented as
some kind of empty/not-empty string, to be easily manipulated in
later usage. The key is to have some point in the code where the
clear status of the event is defined, so branching can be done simply
and easily.
Actions consist of specific code that is executed in the context
of a detected event, such as storing new values, sending traps,
sending email, or performing some other automatically triggered
action. It is good to put these into functions or separate scripts,
since you can have similar actions for many events. Usually we want
to send email to someone or send a trap. It is almost always the
same code in all scripts, so keeping it separate is a good idea.
It is important to add some state support. We are not just interested
in detecting limit violations; if that were the case, we would be
flooded with messages. Detecting state changes can reduce unwanted
messaging. When we define an event in which we are interested, we
actually want to know when the event happened and when it ended
-- that is, when the monitored values passed limits and when
they returned. We are not interested in full-time notification that
the event is still occurring. Thus, we need to know the change of
the event state and value of the monitored variable.
State support is not necessary if there is some kind of console
that can correlate notifications. In the simplest implementations,
like a plain monitoring script, avoiding message flooding directly
in the script itself is useful.
Each event must have a unique name and severity level. Usually,
three levels of severity are enough, but sometimes five levels are
used. It is best to start with a simple model such as:
Info -- Just information that something has happened
Warning -- Warning of possible dangerous situation
Fatal -- Critical situation
Common Problems and Mistakes
There are many possible problems -- most of them related to
efficient scripting and using system tools in the proper way. For
example, if we are scripting in ksh, the rules for good ksh scripting
should be followed. Also, since most of our scripts are fired from
cron, cron peculiarities in the environment must be taken into account,
too. See the References for shell script resources.
Because cron is the most frequently used engine for firing monitoring
scripts, it is good to understand its specific behavior. It is good
practice to set up a working directory and set ulimits to reduce
damage in case of unexpected core dumps and similar mischief. Since
cron implementations can vary, it is also good practice to check
man pages and configuration files to avoid any nasty surprises.
Essential System Administration (Æleen Frisch, O'Reilly
& Associates) is an excellent resource for cron and any other
system scripting.
Alert Notification
Notification is very important and can be counterproductive if
it is not well thought out. At the "information" level,
email notification is enough. However, email was not designed to
be reliable and, because of mail flooding, even simple monitors
can send hundreds of messages that can effectively block communications.
Sendmail is the most commonly used mail agent but on heavily loaded
machines, sendmail stops sending mail. It can be very embarrassing
(and even dangerous) to find a lot of queued emails about high load
waiting to be sent until the load drops.
Message formatting for the notification and the groups of people
to be notified must also be considered. Who gets the notification
and how often can be overlooked in a critical situation. Using centrally
administered mail lists or internal newsgroups (or discussion groups)
is a good, flexible approach. These lists, however, must be organized
in some appropriate way, such as by machine, by service, or some
other logical breakdown of your system. It is also important to
set up a correct sender so recipients can confirm and reply to messages.
The format of the email message, especially the subject line,
is also important. If there are no rules, email filtering is impossible.
It is good to do some thinking before coding. Because the script
is informing us of an event, the email must be descriptive enough
to provide information about who, what, where, when, etc.
The name of the event must be in the subject line (to be unique
if possible). The name of the machine/service, relevant data, and
other information, such as snapshots, can go into the message body.
Be careful not to over-inform because long emails are not useful
and take time to pass. It is good to add the importance of the event
to the subject line. A general rule is that the subject line information
must be sufficient for email filtering on the recipient side. Because
such email often ends up on pager devices, length limitations and
frequency of alerting must be considered.
Another mechanism, which I almost prefer to email notification,
is syslog. Even if you don't have email or other notification,
the data must be sent to syslog. The same rules for creating the
email subject line must be considered in creating a syslog entry
line. Unfortunately, there are no exact rules on how to organize
syslog for good event handling -- it depends on many choices,
but there are good books and articles on this subject and very effective
tools for syslog monitoring. Refer to the books in the Resources
section for more information.
Tools Available
Since we are doing system automation, we can use various tools
already available. Most monitoring scripts are done under pressure
in a typical firefighting situation. It is essential to use a tool
you are familiar with in order to get reliable and non-destructive
results as quickly as possible. Then, you should get a "real"
management tool deployed while you still have momentum and support
on the problem.
Any Unix shell is a good choice to start with, if you are familiar
with shell programming. Many people start with Korn or Bourne shell,
or in some situations with C programs. Perl, Python, or Tcl may
be better choices, but ksh is a good starting place; later it can
be easily recoded into Perl.
Perl is probably the best overall choice, but it is not installed
on all systems and requires experience to be useful. Python is becoming
more important, and Tcl is losing ground. Other excellent tools
like Expect and Scotty are still worth using.
Many factors must be taken into account, but some general rules
are:
1. Use the language you know best.
2. Use Tcl or Python for rapid prototyping and testing.
3. Use Perl for systems administration-oriented tasks.
4. Always think in a language-dependent way (use lists and associative
arrays as much as possible).
It is advisable to use scripts written for other already established
tools. There are repositories of various scripts and, with minimal
effort, it is possible to use concepts and code straight from there.
A good example is Big Brother, a widely used tool with many contributors;
there are many very useful scripts on its repository (see Resources).
Reusing Monitoring Scripts
Every monitoring script has a complex life cycle. Usually it starts
as wrapper around a command whose result is important at the moment,
such as AIX temperature monitoring (Listing 3). Later, the script
may be expanded with some additional conditions (often to avoid
notification flooding), and then it may be used as a template for
other tools. At the end of its life cycle, it becomes forgotten
and sometimes after an innocent system change can even become dangerous.
What's good practice? Obviously, you should follow the rules
of good scripting, be a defensive programmer as much you can, and
always document changes. If you don't have an established practice
for this, look at the system scripts headers. Usually, it's
a good idea to do documentation in the same way it's done for
the system. At minimum, keep versions, dates, changes, and the purpose
of the script in the header. This can be very useful later when
a long-forgotten script is found. The Shelldorado site (see Resources)
provides some excellent examples.
Temporary files are often used and must be handled with care because
of the possibility of file system filling, overwriting data, or
erasing important files. Try to avoid creating and removing temporary
files without very detailed controls because rm from root's
process can be very destructive. One temporary file based on the
script/event name and in well-defined place is usually more than
enough. Remember that variables can be used to store results directly;
pipes can also substitute for temporary files in many cases, and
they can reduce system strain on heavy commands.
It's a good idea to have the state or status remembered so
we can determine whether an alert refers to the same event or a
new one. One simple method is to use files with stored previous
values. If events are abstracted, it is simple to store their values
in files and compare them later with Unix diff. Such a technique
is presented in the examples where diff is used to detect
whether values have changed.
Sometimes the state can be defined as the last line up to where
the log was scanned, which is enough in log file monitoring. The
general idea is to check all lines after the last scanned lines
to avoid event duplication or, in the case of log file rotation,
to redo scanning from the start. This can be done in the shell,
but a much better solution is a nice tool called logtail. It's
written in C and is an excellent solution for a standalone log file
monitor. It works as a simple filter in any ordinary script.
Comments
It is mandatory to use comments in the configuration files; lines
starting with "#" are traditional and are simple to parse
out. Also, it is a good practice to put some default values into
the configuration (such as the DEF line in Listing 1, which is a
catch-all line). Such a catch-all means that your tool enforces
a closed policy -- "all that is not allowed is forbidden".
Thus, if a new file system is added and no rule for it is defined,
the default rule will be used. This is the best practice for the
system with many changes.
Sometimes it is good to have not only logged events but a whole
range of monitored values, usually to establish some baseline sets.
Such data often goes into /tmp, which is a mistake. If you have
enough disk, it is good practice to maintain a separate file system
for specific logs and results. The AIX Redbooks (see Resources)
present some useful techniques. Scripts can be in /usr/local or
any other place where root scripts are kept. Another question is
where to store configuration files and where to put temporaries,
as root scripts and data can be a security risk. Because this is
a root working area with various sensitive tools and data, this
spot must be secured and kept safe.
I often create a /log file system where such data and scripts
are kept. It usually grows into quite a complex structure as various
files end up there. It is good practice to have various subdirectories
like /log/bin for scripts and binaries, /log/tmp for scratch, and
various other directories. My practice is to store even sar files
in /log/sa, with generated reports, various configuration tracking
files, testing results, etc. Backups are also a good idea.
Conclusion
I hope this article will help you avoid some common problems with
monitoring scripts, which are part of any system/network management.
When a system starts to grow, any effort is useless without a systematic
approach and trouble-ticketing tools.
What we get from these monitoring scripts are event notifications.
These events are part of a bigger problem, so something capable
of handling the big picture is necessary. Basically, alert monitors
are of no use if you cannot detect and control the problem behind
them. The moral of this story is to be smart and try to plan ahead.
It is essential to able to rely on monitoring and alerting tools.
When a system becomes huge and expensive, complex tools are needed,
but your old logs and trouble tickets contain priceless data, because
they show relevant information on system behavior, trends, working
parameters, and solutions to past problems.
Resources
There are many resources available on the Internet; I've
listed some books and links below.
Aix5L has quite a novel approach to this problem -- the Resource
Monitoring and Control (RMC) function. It's capable of many
things described in this article, but there are still some teething
problems. If you have AIX in house, I suggest spending some time
with RMC. Most things you need are already there.
IBM Redbooks
A Practical Guide for Resource Monitoring and Control (RMC), SG24-6615-00
-- http://www.redbooks.ibm.com/redbooks/SG246615.html
Managing AIX Server Farms, SG24-6606-00 Redbook --http://www.redbooks.ibm.com/redbooks/SG246606.html
Books
Frisch, Æleen. Essential System Administration, 3rd
Edition, August 2002. O'Reilly & Associates. ISBN: 0-596-00343-9.
Powers, Shelley, J. Peek, T. O'Reilly, and M. Loukides. Unix
Power Tools, 3rd Edition, October 2002. O'Reilly &
Associates. ISBN: 0-596-00330-7.
Blank-Edelman, David. Perl for System Administration, 1st
Edition, July 2000. O'Reilly & Associates. ISBN: 1-56592-609-9.
Links
Stokely Consulting -- http://www.stokely.com/unix.sysadm.resources/index.html
Big Brother Archive -- http://www.deadcat.net/browse.php
BigAdmin Scripts -- http://www.sun.com/bigadmin/scripts/
Shelldorado -- http://www.shelldorado.com
Damir Delija has been a Unix system engineer since 1991. He
received a Ph.D. in Electrical Engineering in 1998. His primary
job is systems administration, education, and other system-related
activities. |