oct2004.tar

Unix Monitoring Scripts

Damir Delija

It is impossible to do systems administration without monitoring and alerting tools. Basically, these tools are scripts, and writing such monitoring scripts is an ancient part of systems administration that's often full of dangerous mistakes and misconceptions.

The traditional way of putting systems together is very stochastic and erratic, and that same method is often followed when developing monitoring tools. It is really rare to find a system that's been properly planned and designed from the start. The usual approach when something goes wrong is just to patch the immediate problem. Often, there are strange results from people making mistakes when they're in a hurry and under pressure.

Monitoring scripts are traditionally fired from root cron and send results by email. These emails can accumulate over time, flooding people with strange mails, creating problems on the monitored system, and causing other unexpected situations. Such scenarios are often unavoidable, because few enterprises can afford better measures than firefighting. In this article, I will mention a few tips that can be helpful when developing monitoring scripts, and I will provide three sample scripts.

What is a Unix Monitoring Script?

A monitoring tool or script is part of system management and to be really efficient must be part of an enterprise-wide effort, not a standalone tool. Its purpose is to detect problems and send alerts or, rarely, to try to correct the problem. Basically, a monitoring/alerting tool consists of four different parts:

1. Configuration -- Defines the environment and does initializations, sets the defaults, etc.

2. Sensor -- Collects data from the system or fetches pre-stored data.

3. Conditions -- Decides whether events are fired.

4. Actions -- Takes action if events are fired.

If these elements are simply bundled into a script without thinking, the script will be ineffective and un-adaptable. Good tools also include an abstraction layer added to simplify things later, when modifications are done.

To begin, we have to set some values, do some sanity checks, and even determine whether monitoring is allowed. In some situations, it is good to stop monitoring through the control file to avoid false notifications, during maintenance for example. This is all done in the configuration part of the script.

The script collects values from the system -- from monitored processes or the environment. This data collecting is done by the sensor part. This data can be the output of an external command or can be fetched from previously stored values, such as the current df output or previously stored df values (see Listing 1).

The conditions part of the script defines the events that are monitored. Each condition detects whether an event has happened and whether this is the start or the end of the event (arming or rearming). This process can compare current values to predefined limits or to stored values, if we are interested in rates instead of absolute values. Events can also be based on composite or calculated values, such as "Average idle from sar for the last 5 minutes is less than 10%" (see Listing 2).

Results at this level are logical values usually presented as some kind of empty/not-empty string, to be easily manipulated in later usage. The key is to have some point in the code where the clear status of the event is defined, so branching can be done simply and easily.

Actions consist of specific code that is executed in the context of a detected event, such as storing new values, sending traps, sending email, or performing some other automatically triggered action. It is good to put these into functions or separate scripts, since you can have similar actions for many events. Usually we want to send email to someone or send a trap. It is almost always the same code in all scripts, so keeping it separate is a good idea.

It is important to add some state support. We are not just interested in detecting limit violations; if that were the case, we would be flooded with messages. Detecting state changes can reduce unwanted messaging. When we define an event in which we are interested, we actually want to know when the event happened and when it ended -- that is, when the monitored values passed limits and when they returned. We are not interested in full-time notification that the event is still occurring. Thus, we need to know the change of the event state and value of the monitored variable.

State support is not necessary if there is some kind of console that can correlate notifications. In the simplest implementations, like a plain monitoring script, avoiding message flooding directly in the script itself is useful.

Each event must have a unique name and severity level. Usually, three levels of severity are enough, but sometimes five levels are used. It is best to start with a simple model such as:

Info -- Just information that something has happened
Warning -- Warning of possible dangerous situation
Fatal -- Critical situation

Common Problems and Mistakes

There are many possible problems -- most of them related to efficient scripting and using system tools in the proper way. For example, if we are scripting in ksh, the rules for good ksh scripting should be followed. Also, since most of our scripts are fired from cron, cron peculiarities in the environment must be taken into account, too. See the References for shell script resources.

Because cron is the most frequently used engine for firing monitoring scripts, it is good to understand its specific behavior. It is good practice to set up a working directory and set ulimits to reduce damage in case of unexpected core dumps and similar mischief. Since cron implementations can vary, it is also good practice to check man pages and configuration files to avoid any nasty surprises. Essential System Administration (Æleen Frisch, O'Reilly & Associates) is an excellent resource for cron and any other system scripting.

Alert Notification

Notification is very important and can be counterproductive if it is not well thought out. At the "information" level, email notification is enough. However, email was not designed to be reliable and, because of mail flooding, even simple monitors can send hundreds of messages that can effectively block communications. Sendmail is the most commonly used mail agent but on heavily loaded machines, sendmail stops sending mail. It can be very embarrassing (and even dangerous) to find a lot of queued emails about high load waiting to be sent until the load drops.

Message formatting for the notification and the groups of people to be notified must also be considered. Who gets the notification and how often can be overlooked in a critical situation. Using centrally administered mail lists or internal newsgroups (or discussion groups) is a good, flexible approach. These lists, however, must be organized in some appropriate way, such as by machine, by service, or some other logical breakdown of your system. It is also important to set up a correct sender so recipients can confirm and reply to messages.

The format of the email message, especially the subject line, is also important. If there are no rules, email filtering is impossible. It is good to do some thinking before coding. Because the script is informing us of an event, the email must be descriptive enough to provide information about who, what, where, when, etc.

The name of the event must be in the subject line (to be unique if possible). The name of the machine/service, relevant data, and other information, such as snapshots, can go into the message body. Be careful not to over-inform because long emails are not useful and take time to pass. It is good to add the importance of the event to the subject line. A general rule is that the subject line information must be sufficient for email filtering on the recipient side. Because such email often ends up on pager devices, length limitations and frequency of alerting must be considered.

Another mechanism, which I almost prefer to email notification, is syslog. Even if you don't have email or other notification, the data must be sent to syslog. The same rules for creating the email subject line must be considered in creating a syslog entry line. Unfortunately, there are no exact rules on how to organize syslog for good event handling -- it depends on many choices, but there are good books and articles on this subject and very effective tools for syslog monitoring. Refer to the books in the Resources section for more information.

Tools Available

Since we are doing system automation, we can use various tools already available. Most monitoring scripts are done under pressure in a typical firefighting situation. It is essential to use a tool you are familiar with in order to get reliable and non-destructive results as quickly as possible. Then, you should get a "real" management tool deployed while you still have momentum and support on the problem.

Any Unix shell is a good choice to start with, if you are familiar with shell programming. Many people start with Korn or Bourne shell, or in some situations with C programs. Perl, Python, or Tcl may be better choices, but ksh is a good starting place; later it can be easily recoded into Perl.

Perl is probably the best overall choice, but it is not installed on all systems and requires experience to be useful. Python is becoming more important, and Tcl is losing ground. Other excellent tools like Expect and Scotty are still worth using.

Many factors must be taken into account, but some general rules are:

1. Use the language you know best.

2. Use Tcl or Python for rapid prototyping and testing.

3. Use Perl for systems administration-oriented tasks.

4. Always think in a language-dependent way (use lists and associative arrays as much as possible).

It is advisable to use scripts written for other already established tools. There are repositories of various scripts and, with minimal effort, it is possible to use concepts and code straight from there. A good example is Big Brother, a widely used tool with many contributors; there are many very useful scripts on its repository (see Resources).

Reusing Monitoring Scripts

Every monitoring script has a complex life cycle. Usually it starts as wrapper around a command whose result is important at the moment, such as AIX temperature monitoring (Listing 3). Later, the script may be expanded with some additional conditions (often to avoid notification flooding), and then it may be used as a template for other tools. At the end of its life cycle, it becomes forgotten and sometimes after an innocent system change can even become dangerous.

What's good practice? Obviously, you should follow the rules of good scripting, be a defensive programmer as much you can, and always document changes. If you don't have an established practice for this, look at the system scripts headers. Usually, it's a good idea to do documentation in the same way it's done for the system. At minimum, keep versions, dates, changes, and the purpose of the script in the header. This can be very useful later when a long-forgotten script is found. The Shelldorado site (see Resources) provides some excellent examples.

Temporary files are often used and must be handled with care because of the possibility of file system filling, overwriting data, or erasing important files. Try to avoid creating and removing temporary files without very detailed controls because rm from root's process can be very destructive. One temporary file based on the script/event name and in well-defined place is usually more than enough. Remember that variables can be used to store results directly; pipes can also substitute for temporary files in many cases, and they can reduce system strain on heavy commands.

It's a good idea to have the state or status remembered so we can determine whether an alert refers to the same event or a new one. One simple method is to use files with stored previous values. If events are abstracted, it is simple to store their values in files and compare them later with Unix diff. Such a technique is presented in the examples where diff is used to detect whether values have changed.

Sometimes the state can be defined as the last line up to where the log was scanned, which is enough in log file monitoring. The general idea is to check all lines after the last scanned lines to avoid event duplication or, in the case of log file rotation, to redo scanning from the start. This can be done in the shell, but a much better solution is a nice tool called logtail. It's written in C and is an excellent solution for a standalone log file monitor. It works as a simple filter in any ordinary script.

Comments

It is mandatory to use comments in the configuration files; lines starting with "#" are traditional and are simple to parse out. Also, it is a good practice to put some default values into the configuration (such as the DEF line in Listing 1, which is a catch-all line). Such a catch-all means that your tool enforces a closed policy -- "all that is not allowed is forbidden". Thus, if a new file system is added and no rule for it is defined, the default rule will be used. This is the best practice for the system with many changes.

Sometimes it is good to have not only logged events but a whole range of monitored values, usually to establish some baseline sets. Such data often goes into /tmp, which is a mistake. If you have enough disk, it is good practice to maintain a separate file system for specific logs and results. The AIX Redbooks (see Resources) present some useful techniques. Scripts can be in /usr/local or any other place where root scripts are kept. Another question is where to store configuration files and where to put temporaries, as root scripts and data can be a security risk. Because this is a root working area with various sensitive tools and data, this spot must be secured and kept safe.

I often create a /log file system where such data and scripts are kept. It usually grows into quite a complex structure as various files end up there. It is good practice to have various subdirectories like /log/bin for scripts and binaries, /log/tmp for scratch, and various other directories. My practice is to store even sar files in /log/sa, with generated reports, various configuration tracking files, testing results, etc. Backups are also a good idea.

Conclusion

I hope this article will help you avoid some common problems with monitoring scripts, which are part of any system/network management. When a system starts to grow, any effort is useless without a systematic approach and trouble-ticketing tools.

What we get from these monitoring scripts are event notifications. These events are part of a bigger problem, so something capable of handling the big picture is necessary. Basically, alert monitors are of no use if you cannot detect and control the problem behind them. The moral of this story is to be smart and try to plan ahead. It is essential to able to rely on monitoring and alerting tools. When a system becomes huge and expensive, complex tools are needed, but your old logs and trouble tickets contain priceless data, because they show relevant information on system behavior, trends, working parameters, and solutions to past problems.

Resources

There are many resources available on the Internet; I've listed some books and links below.

Aix5L has quite a novel approach to this problem -- the Resource Monitoring and Control (RMC) function. It's capable of many things described in this article, but there are still some teething problems. If you have AIX in house, I suggest spending some time with RMC. Most things you need are already there.

IBM Redbooks

A Practical Guide for Resource Monitoring and Control (RMC), SG24-6615-00 -- http://www.redbooks.ibm.com/redbooks/SG246615.html

Managing AIX Server Farms, SG24-6606-00 Redbook --http://www.redbooks.ibm.com/redbooks/SG246606.html

Books

Frisch, Æleen. Essential System Administration, 3rd Edition, August 2002. O'Reilly & Associates. ISBN: 0-596-00343-9.

Powers, Shelley, J. Peek, T. O'Reilly, and M. Loukides. Unix Power Tools, 3rd Edition, October 2002. O'Reilly & Associates. ISBN: 0-596-00330-7.

Blank-Edelman, David. Perl for System Administration, 1st Edition, July 2000. O'Reilly & Associates. ISBN: 1-56592-609-9.

Links

Stokely Consulting -- http://www.stokely.com/unix.sysadm.resources/index.html

Big Brother Archive -- http://www.deadcat.net/browse.php

BigAdmin Scripts -- http://www.sun.com/bigadmin/scripts/

Shelldorado -- http://www.shelldorado.com

Damir Delija has been a Unix system engineer since 1991. He received a Ph.D. in Electrical Engineering in 1998. His primary job is systems administration, education, and other system-related activities.