aug2005.tar

Techniques for Production-Grade Scripts

Brian Martin

Systems administrators have many responsibilities including maintaining and upgrading operating systems, supporting the development staff, serving as the help desk (fixing printers and supporting users), and more. We never have enough time, so we write scripts. We write scripts for ourselves and our users to turn complex tasks into simpler, faster ones. Some of these scripts run quickly, with someone right there watching the results. Others run unattended, often overnight or as the result of a crontab entry. Some scripts are fairly unimportant -- it's nice to have a pretty graph of network statistics, but hardly catastrophic if the script fails. Other scripts are vital, either to the operation of the system, or to the operation of the business.

In this article, I will focus on this last class of scripts -- what I call "production scripts". Production scripts have the following characteristics:

They support an important system or business function.
They often run unattended.
There may be many of them to track and maintain.
Most importantly, they must not fail without someone knowing about it.

I'll present some techniques I've developed over the years to create production scripts that are highly reliable and require minimal support. By bringing these characteristics to my production scripts, I reduce the time they require (both in their development and their ongoing operation), giving me more time for the rest of my job.

Improving Reliability

Scripts fail. We don't have the technology to build scripts to handle every possible event. Even the best backup script can't overcome a power failure or a dead hard drive. With a little effort, however, it is possible to eliminate some of the most common failures and improve our ability to detect the rest. Some of the ways to improve reliability include:

Controlling the execution environment
Checking return codes
Checking the output
Providing execution reports

Controlling the Environment

Scripts are often sensitive to their execution environment. The current path, the current directory, and the user-ID under which they're running can all play a critical role in a script's success or failure. Most of us have had the experience of developing and testing a script only to find that it runs fine manually but not when scheduled through cron. This is because cron jobs don't log on, so they don't run /etc/profile or the user's .profile (or its equivalents).

Scripts inherit a different environment when running under cron, and many of the assumptions made when the script was written and tested become incorrect. The way to prevent these problems is to have the script explicitly set its path and other environment settings. If it matters which directory the script is in, the script should cd to it. If it matters which userID the script is running under, the script should check it. Assuring that a script is running with the proper environment is one of the easiest ways to improve reliability.

Checking Return Codes

Checking return codes is probably the first thing that comes to mind to improve reliability, but it's surprising how often we don't do it. Our scripts will issue a cd command and assume it worked, only to discover later, for instance, that the target directory is an NFS mount that isn't always mounted.

I used to type the following at the command line:

cd /usr/local/data; tar -xf /remote/local/data.tar

One time the cd failed due to a mistyped directory name. Unfortunately, I was in /etc at the time. Since the cd failed, all of my data files got untarred into /etc instead. Now I type:

cd /usr/local/data && tar -xf /remote/local/data.tar

The "&&" assures that the tar doesn't execute if the cd fails. It's a simple method for checking the return code. A script should probably use a more complex statement to check the return code, issue an error message, and abort the script if necessary, but this demonstrates the need for always checking return codes. After all, if the first example created havoc when run by hand at the command line, imagine what it could have done in a script that might have issued 30 or 40 more commands based on the assumption that the cd and the tar did what they were expected to do.

Checking the Output

Scripts should check their own output for anomalies. This is probably the most radical and most useful technique I have to offer. Checking return codes is good, but we can do more. Consider the following experience. One of my clients has several hundred customer service representatives. These CSRs receive phone calls from customers requesting service changes, and the requests are entered into a database. I was charged with writing a script to export those requests from the database and pass them on to another program for processing. It was vital that each request be processed only once, so after successfully exporting the data, it was decided that the script should delete the requests from the database.

I wrote my script to do this, carefully checking the return code from the export utility before deleting the data from the database. This worked fine for months. Some failures occurred from time to time, and the script properly avoided deleting the data, just as it should. Then one night, the file system to which it was exporting the data to ran out of disk space. The export utility issued an appropriate error message, but then it exited with a normal return code due to an internal bug. My script checked the return code, found that it indicated a successful export, and deleted the data from the database. Effectively, all those customer requests were lost. We would have been better off if we'd just closed the office that day -- at least that way the customers would have known their change requests weren't processed.

Having a script monitor its own output doesn't solve every problem, but it's a great improvement over just checking return codes. Nearly every failure generates an error message, and we can use that to improve our error detection and reliability. As I'll show later, it's actually pretty easy to do, too.

Providing Execution Reports

Having a critical script detect its own failures might limit further damage (like filling /etc with inappropriate files or deleting unprocessed data), but it doesn't fix the original problem. When a script fails, someone needs to know about it. The production scripts I design provide facilities for emailing execution reports to designated addresses. I'll show how these come into play later, but these facilities are key to recovering from failures and keeping important tasks from falling through the cracks.

Minimizing Support Requirements

So far, I've been focusing on how to produce more reliable scripts. Our other objective, though, is to reduce support requirements. More specifically, what can be done to minimize the time needed to write and support production scripts? To do this, scripts need the following qualities:

Quick summary reports -- Eventually there may be a lot of execution reports to read, so it must be very easy to tell whether a script ran properly or not. Two to three seconds per execution report to determine whether a script was successful is a good goal, and one I achieve.
Detailed execution reports -- Detailed reports need to be available to help diagnose failures. If a production script has a problem, we need access to the details without having to run it again. In essence, there must be two execution reports -- a quick summary and another, more detailed report.
Failure alerts -- Some jobs are so urgent that failures must be dealt with immediately. Production scripts should have the built-in ability to provide additional notifications on failure (say, to your pager) beyond the normal execution report distribution channels.
Built-in testing facilities, diagnostics, and usage information -- Built-in testing facilities and diagnostics can reduce development time. Furthermore, once a production script begins to run reliably, it may be months or years before we need to run it again manually. By that time, details will have been forgotten, so built-in traces and usage panels will help us come up to speed again quickly.
Fast development -- Scripts need all the reliability and maintainability characteristics mentioned above, but with minimal development time.

The Cost

These are some pretty aggressive goals, and there's going to be a cost to accomplish them. The cost is, quite simply, more code. It's going to take lot of code to do this, but it turns out not to be much of a burden because the extra code is similar from script to script, so it quickly becomes boilerplate. We know it's there. It's always there and always in the same place, so there's no point in reading it each time we look at the code. It is simply "background noise" to be ignored.

The Template

So, we're going to do all these great things, and we're going to do them quickly. How? The answer is "the template". The template is a fully functioning script that meets all the criteria above but does absolutely nothing. It sets up the environment, checks output for anomalies, provides summary and detailed execution reports, etc. It's an empty shell (no pun intended), with all the necessary startup and termination code, but no logic to do anything useful. When we want to write a new, production-grade script, we copy the template to a new name, insert the logic to fulfill our new objective, and begin testing.

Notice that by using a fully functioning template, we have all the production-grade functionality from the very first test that we want in the end product. We have it running in a specific, controlled environment. We have it checking for output anomalies. Best of all, we have the built-in testing and diagnostic tools that will help speed our development.

The techniques outlined above are equally valuable in any production script, regardless of the scripting language. I've used these techniques in Korn shell scripts, in bash scripts, and in Perl. How these techniques are implemented may change from language to language, but the benefits are the same.

Next, I'm going to demonstrate an implementation of the template. I've picked bash for my examples, as it's widely known and familiar to many people. The concepts, however, apply to any scripting language.

The bash template actually consists of three files, which can be found on the Sys Admin Web site (http://www.sysadminmag.com/code/). The "template" file contains the actual script code. The "template.filter" is an awk script used to detect anomalies in the output (like "file system full"). The "template.cfg" file is a place to store default options and other external parameters. Typically, it contains a line in the format "ALLJOBS: xxx", where "xxx" is replaced with any command-line options we desire.

Production scripts often have a number of command-line options or parameters that are relatively constant. Many times these options include multiple email addresses, long directory paths, etc. By placing these in the configuration file, we avoid the problems caused by command-line typos or forgotten options.

Template

Before getting into the details of the template code, let's look at part of its user interface. The template is a script, and as such it supports a series of command-line options. Any script built from the template automatically includes these features. This provides functionality and also a great deal of commonality between all the scripts based on the template. The command-line options the template provides are:

-m mailID: Mail -- Send a summary execution report to this email address. This option may be repeated to send the same report to multiple addresses.

-e mailed: Error -- Send a summary execution report to this email address, but only if errors are detected. This can be used to notify a pager of a critical failure or to copy a more experienced technician whenever failures occur.

-t: Test -- Document any commands that would make significant changes but don't really run them.

-v: Verbose -- Document any commands that would make significant changes and then run them.

-d: Debug -- Run with a diagnostic trace.

-h: Help -- Display a usage panel.

Let's take a quick tour around the template file. There are about 250 lines of code here, but I won't go into them in detail. Instead, like someone showing you around a new city, I'll point out the major sights, and then you can explore the ones that interest you.

In the very beginning are some comments about the nature of the script (cmdname - command description). These should be replaced with a brief description of the new script for documentation purposes.

Next is the Usage subroutine ("function", in bash terminology). This subroutine contains the built-in help panel, invoked when the script is executed with the -h option. It is placed at the top of the script for easy reference by someone editing the file. All the standard options and return codes are already documented, so all you need to do is update the one-line script description, and add additional information if you add options, parameters, or return codes.

After Usage comes the Sample function. This as a model subroutine that can be duplicated any time a new subroutine is required. (Remember, we're trying to reduce development time -- making this code easy to copy makes setting up new subroutines quicker and less error prone.)

Included in Sample is logic to check whether the script is running in diagnostic mode. In bash and its predecessors, a trace is turned on using set -xv. In some implementations of Korn shell, and perhaps in some implementations of bash, turning on a diagnostic trace for the mainline code does not automatically turn it on for the subroutines. This logic makes sure that if the mainline code is running with a trace, the subroutines run with a trace also.

Next comes RunJob. This is where all the useful work gets done. You'll see the same trace logic as in Sample, followed by a couple of echo statements to document the program startup, and a comment to "Add main code here". This is where you should insert the code to do whatever your new script is supposed to do.

The last function is RunDangerousCmd. RunDangerousCmd is the essential component of our testing facilities. It requires a small amount of discipline to use, in that any time the script is going to run a command that changes anything important, it must invoke the command through RunDangerousCmd. The following is an example of the correct way to execute an rm command:

RunDangerousCmd "rm -f *.dat"

The benefit of using RunDangerousCmd is that it supports the -t (test) and -v (verbose) command-line options. If your script is coded properly (i.e., it uses RunDangerousCmd everywhere it should), running the script with -t will prevent it from executing any dangerous commands. Instead of executing them, RunDangerousCmd will simply display them, telling you what it would have done were it not in test mode. So, the above example in normal mode would delete any files ending in ".dat", but in test mode would merely generate a message saying:

Testing: rm -f *.dat

without deleting any files. This is a tremendous benefit when you've got a script that deletes or renames input files on each run, because you can run it over and over while testing, without having to recreate these input files. For sys admins who have separate development and production environments, test mode also lets you run a final test as part of moving the script into production, without actually doing anything to your production environment.

RunDangerousCmd also supports the -v (verbose) command-line flag. With -v specified, dangerous commands are executed normally, but immediately prior to doing so RunDangerousCmd displays a message documenting the command it is about to execute. This allows a script to run normally, but gives you better insight into exactly what it is doing and in which sequence it is doing it. (In case -t and -v are both specified, -t takes priority and no dangerous commands are executed.)

Finally, we reach the mainline code. Here we set up the environment, including such things as setting the path, setting the initial directory ($RunDir), declaring any valid command-line options ($ValidOptions), and initializing any temporary files. We also set any defaults for command-line options and process any options from the ".cfg" file and/or the actual command line.

Once the preparation is done, the mainline code calls RunJob to do the actual work, and any output is redirected into awk. awk will use the filter file to check all the output (more about that later). The output from RunJob will also be captured to a file (the detailed execution report), and the output from awk will be captured to another file (the summary execution report). As currently coded, the detailed execution report is stored in the same directory as the script, with a suffix of ".log", but you may want to tailor that to something more appropriate to your own environment.

Finally, the mainline code will determine whether any errors were detected, and deliver the summary execution report, and an error notification if necessary, via email.

Once the script exits, a "trap" statement executed earlier will cause the temporary files to be deleted. Note that this happens for nearly every type of trappable exit, not just planned terminations. This is also one reason to use kill instead of kill -9 on a process if possible, as kill -9 is not trappable.

The Output Filter

One key to error detection is checking the output for unexpected messages. The template.filter file receives all output messages from RunJob and classifies them into one of three categories:

1. Normal messages I want to see
2. Normal messages I don't want to see
3. Error messages (everything else)

In addition to categorizing each message that the output filter prepares the summary execution report. The first category of messages ("Normal messages I want to see") lists all the possible, normal messages that should appear in the summary report. This should be a fairly short list, as we want that report to be a fast read. The execution report is delivered by email, and my rule is that the entire summary report must fit on a single screen when I open the email. If I have to page down to see the end of the report, it will take too much time or I may forget to do so and miss an important error message.

The second category lists normal messages I don't want to see. These may be progress messages (i.e., "1000 records processed", "2000 records processed", etc.) or other details that are interesting once in a while but don't usually relate to the overall success or failure of the job.

The tar command with the -v option, for example, will list every file it backs up. This is good data to have in the detail report in case someone asks about a specific file, but it's not something you want to see in the summary report.

The third category has no list at all. We never specify what qualifies as an error message because we can never create a complete list of all the possible ways something might fail. Instead, anything that isn't in one of the two "Normal" lists is automatically a error. In this way, the script errs on the side of flagging a normal message as an error, rather than the other way around.

It's common in the early stages of production use to need to adjust the filter several times. I often find that new messages will appear that are normal (often uninteresting), but that don't show up on every run. The tar command, for example, might report a "socket ignored" message because a new process is running elsewhere that created a new socket file. Updating the filter file as new "normal" messages are discovered only takes a moment or two, but it's vital to achieving our goal of having those summary reports be a fast read.

A Demonstration

When I give presentations on this topic, I always do a demonstration, so here's the template in action. As a consultant and contract sys admin, I often inherit backup scripts from others. Many times, the backup script looks something like this:

tar -cvf - --totals -C /home . | gzip -c > /backup/home.tar.gz
tar -cvf - --totals -C /etc . | gzip -c > /backup/etc.tar.gz

These statements each tar up a directory, compress the resulting data, and write it to a file. The --totals option says to report the total byte count on completion. The -C option says to temporarily change directories to the named directory, which can make some kinds of restores easier.

Scripts such as these are typically executed from cron. Usually the -v option isn't specified because it generates too much output. Without the -v, however, we have no way to check what was on the backup short of rescanning it. In the case where the backup has been made to tape, determining whether a file is on the tape can take an hour or more.

Let's see how the template helps us. First duplicate the template as follows:

cp template backup
cp template.filter backup.filter
cp template.cfg backup.cfg

Now we have a fully working script that does nothing. Next, insert code to run our backup as follows:

1. Find the comment in RunJobs that says "Add main code here".

2. Remove the comment, and replace it with:

   RunDangerousCmd \
     "tar -cvf - --totals -C /home . | gzip -c > /backup/home.tar.gz"
   RunDangerousCmd \
     "tar -cvf - --totals -C /etc . | gzip -c > /backup/etc.tar.gz"

Since the tar/gzip commands overwrite files, they are "dangerous" commands and should be invoked via a call to RunDangerousCmd.

For simplicity, I've left out the logic to check the return codes, but this can be as easy as following each call with [[ $? -ne 0 ]] && echo gzip failed. Piped commands like those above are another reason to check output messages for anomalies in addition to checking return codes. In pipes, we only get a non-zero return code if the last command detects a failure. Other return codes, like the one from the tar command, are lost.

Now we're ready to test. If we run it with the -t option (i.e., ./backup -t), we get the following:

# backup -t
   backup started Fri Mar 11 15:43:03 PST 2005
Command parameters: -t
   Testing: tar -cvf - --totals -C /home . | gzip -c > /backup/home.tar.gz
   Testing: tar -cvf - --totals -C /etc . | gzip -c > /backup/etc.tar.gz
   backup ended Fri Mar 11 15:43:03 PST 2005

This is not very enlightening, but it shows some of the benefits of using the template. Each job gets a starting and ending timestamp, and any command-line options are documented. Furthermore, if the commands to be executed had contained script variables, we'd be able to see the actual values the variables had resolved to.

Now, let's run it for real:

# backup
backup started Fri Mar 11 15:43:17 PST 2005
    Command parameters:
->    ./
->    ./brian/
->    ./brian/.bash_logout
   .
   .
   .
->    ./dhcpd.conf
       ->    Total bytes written : 67348480 (64MB, 4.3MB/s)
Unexpected messages ("->") detected in execution.
    backup ended with errors Friday Mar 11 15:45:44 PST 2005

There are several things to notice here. First, the script generated a lot of output (tens of thousands of lines, perhaps), most of which are file names that we don't usually care to see. Second, all of these lines are prefixed with -> arrows. Third, there's the note at the bottom about "Unexpected messages", and finally, the closing line says that the backup ended with errors. What goes on here?

Actually, the backup probably ran fine, although it's hard to tell. While we updated the script to run the necessary tar and gzip commands, we never updated the filter to identify the normal messages. Because everything that isn't a normal message is considered an error, the script flagged every line from tar as an error. The unexpected messages were prefixed with arrows (->) to draw attention to them, and the note at the bottom and the altered timestamp summarize that errors were detected.

So let's update the filter. As previously described, there are really two lists in the filter file -- one for normal messages we don't want to see, and one for normal messages we do want to see. A good candidate for the first list are all those file names. We'd like them logged in the detailed execution report in case we need to refer to them, but we don't want them in our summary reports. (Remember, we only want to take 2-3 seconds with the summary report to determine if everything went OK.) The template filter file includes the following:

#
# The following are normal messages we don't need to see
#
/^insert-message-text-here$/                || \
/^insert-message-text-here$/                || \
/^$/                                           \
{next}        # Uninteresting message.  Skip it.

Which we'll change to the following:

#
# The following are normal messages we don't need to see
#
/^\.\//                            || \
/^insert-message-text-here$/       || \
/^$/                                  \
{next}        # Uninteresting message.  Skip it.

Basically, this tells the filter that any line beginning with ./ is a normal line that we don't want to see. Unfortunately for this example, both the period and the slash are special characters to awk, so they have to be prefixed with backslashes (i.e., \.\/ instead of just ./).

We've changed one line in the filter. We know we have more to do, but let's run it again just see how far we've come:

# ./backup
backup started Fri Mar 11 15:45:12 PST 2005
Command parameters:
->    tar: ./brian/var/mydata.sock: socket ignored
->    Total bytes written: 937665920 (64MB, 4.1MB/s)
->    Total bytes written: 67665920 (64MB, 4.3MB/s)
        Unexpected messages ("->") detected in execution.
        backup ended with errors Friday Mar 11 15:47:42 PST 2005

That's a remarkable difference. We're down to seven lines of output, including one error we never saw before because it was lost amongst all the other extraneous messages. In fact, the "socket ignored" isn't really an error, since sockets aren't real files and so can never be backed up. We'll add that to our list of normal messages to ignore. A very similar list in the filter file holds normal messages we want to see, and we'll add the "Total bytes" message to that. After these two lines have been added to the filter, the output looks like this:

# ./backup
backup started Fri Mar 11 15:49:02 PST 2005
Command parameters:
Total bytes written: 937665920 (64MB, 4.1MB/s)
Total bytes written: 67665920 (64MB, 4.3MB/s)
        backup ended Friday Mar 11 15:51:32 PST 2005

That's a successful run and a good summary report. We can tell at a glance that it ran properly (no arrows, no "unexpected messages" text), and we can tell how much data was written. As mentioned previously, a detailed execution report has also been stored in the "backup.log" file. This file contains every message that came out of RunJob, including all file names that were backed up.

When we're ready to put this in production, we'll probably want to add a -m option (i.e., -m brian@example.com), which will send us an email containing the same summary information each time the script runs. This can be done as a command-line parameter in the crontab entry if that's how we're launching it, or perhaps more conveniently, in the backup.cfg file as follows:

ALLJOBS: -m brian@example.com

The template looks for the "ALLJOBS:" line and prepends any option found there to the beginning of the command-line options.

When I do this demonstration as part of my presentation, it usually takes about 10 minutes including all the discussion to go from our two-line script to this one with aggressive error checking, diagnostic and usage panels, summary email notification, detailed logging, and all the rest.

The Catch

Anything this easy has to have a catch. It does some very aggressive error checking, and it produces nice summary reports without a lot of extraneous data. But what if it never runs? Remember that one of the goals is that a production job must never "fall through the cracks". We must always know whether the job was successful. These techniques will catch many kinds of errors, but if the job never runs at all due to a system outage or other failure, we'll never get notified. We can start doing convoluted things like writing new scripts that check whether other scripts ran, but then we have to be sure that the new scripts ran, too.

Ultimately, the responsibility for these production scripts is ours, not our computers', and so my solution is to use a checklist like the one shown in Figure 1. This is a simple calendar showing which scripts are supposed to run on each day. As I go through my email of execution reports, I check off each one that ran, noting any that detected errors. Once I've finished with my review for the day, I make sure that every script that was supposed to run has a check mark in its box. Any scripts that fail to "report in" bear further investigation.

This checklist may seem somewhat tedious, but at 2-3 seconds apiece, it doesn't take a lot of time. Furthermore, we've made checking for errors so easy ("just look for the arrows") that it can be delegated to a junior person or to non-technical staff. They can maintain the daily check list and notify us when things fail. This responsibility is also often the first task I give to any new member of the technical staff. It takes minimal training, but quickly gets the individual familiar with our daily production job mix. When a failure occurs, I work with the new staff member to resolve it, giving them more experience with the most critical aspects of our systems.

Pulling it all Together

Let's see how scripts based on the template meets our needs for improving reliability and minimizing support time. I'll list our criteria again and see how the template addresses them.

Controlling the execution environment -- The template contains explicit an PATH statement and places us in an explicit directory. It also makes sure that temporary files are in a clean condition. Although not contained in my sample, when appropriate I also insert code to verify the current user-ID. Also not mentioned, but included in the code, the template explicitly includes an execution of /etc/profile to make sure any system-wide settings are present.
Checking return codes -- The template checks the return codes on its own statement (such as its cd). You'll need to do the same for any code you add.
Checking the output -- The output filter checks all the output from RunJob, flags unexpected messages, and causes the script to exit with an error return code when errors are detected.
Providing execution reports/Quick summary reports/Detailed execution reports -- The -m option of the template will cause a summary execution report to be delivered to the designated individual(s). Appropriate entries in the output filter file will make sure these are suitably short. A detailed execution report is also stored as a file.
Failure alerts -- The -e option will cause a summary report to be sent to additional address only on failure. This can be used to notify a pager or a more skilled technician whenever errors occur.
Built-in testing facilities, diagnostics, and usage panels -- The -t and -v options provide help in testing the code. The -d option provides a detailed trace of the entire script. The -h option presents a usage panel.
Fast to develop -- We built a backup script with all the above characteristics in fewer than 10 minutes. With a good template, the functionality of the resulting script has increased, and thanks to the availability of the built-in diagnostics and logging, the development time often decreases.

Conclusion

I've presented some techniques I use to increase the reliability of production scripts while reducing development and support time. Furthermore, I've shown how wrapping these techniques into a scripting template allows me to create scripts with a great deal of functionality in very little time. I've demonstrated taking a simple backup script, and with just a few minutes of editing, provided it with a controlled execution environment, aggressive error checking, summary and detailed execution reports, built-in testing and diagnostics, and more. Finally, I've discussed how to make sure critical jobs don't fall through the cracks. By improving script reliability and reducing support time, I've improved the quality and productivity of my work and freed up more of that precious time.

Brian Martin is the founder and chief consultant of Martin Consulting Services, Inc. in Portland, Oregon. His company specializes in systems administration and strategic planning services for companies of all sizes using a variety of open source and proprietary operating systems. You can contact him at info@martinconsulting.com, or by visiting http://www.martinconsulting.com.