Monitoring Solaris Cron Logs with check_cron

John Spurgeon and Ed Schaefer

Running jobs with cron is convenient. So convenient that we tend to forget about it and assume that jobs are being run as planned. But do we know for sure? Did that backup job really run like it was supposed to? Are those pesky log files still being trimmed? Probably -- but maybe not. In this article, we present a Korn shell script called check_cron that can give even the most paranoid systems administrator a little more peace of mind.

You probably know that the cron command starts a process that executes commands at specified dates and times. However, did you know that some versions of Unix can keep a log file listing the commands that were executed and the times at which they ran? On Solaris, this information is written to /var/cron/log.

The check_cron program (Listing 1) processes entries in /var/cron/log for all jobs that recently terminated. The script prints a message indicating whether the jobs appear to have completed successfully. Optionally, check_cron can print detailed statistics about each job and generate alerts when non-zero return codes are encountered.

Contents of /var/cron/log

The line CRONLOG=YES in /etc/default/cron turns on logging. With logging turned on, the following information is written to /var/cron/log each time a job executes:

The command as it appears in the user's crontab file
A user id
A process id
The date and time the job started and ended

A non-zero return code is written to the log file if there was a problem executing the job (permission problems, path issues, etc.), or if the command itself completed with a non-zero exit code.

To keep the log file from growing too large, the script /etc/cron.d/logchecker is typically run as a root cron job. Because check_cron processes the entire log file, performance is proportional to the size of the log file.

Here is a sample log file entry:

>  CMD: /usr/local/bin/sample.ss
>  root 28170 c Wed Jan 21 13:49:00 2004
>  CMD: /opt/monitor/bin/check_engine_up -a >> /var/log/mon/ck_log 2>&1
>  monitor 28171 c Wed Jan 21 13:49:00 2004
<  monitor 28171 c Wed Jan 21 13:49:23 2004
<  root 28170 c Wed Jan 21 13:49:26 2004 rc=137

Lines beginning with a greater than sign (>) are written to the file when the job starts. Lines beginning with a less than sign (<) are written to the file when the job completes.

Lines 1-2 are appended to the file when the sample.ss program is started by cron. Lines 3-4 are appended to the file when cron starts check_engine_up. Line 5 is appended when the check_engine_up job completes, and line 6 is appended when sample.ss completes. Note how process ids 28170 and 28171 are used to tie together the lines for sample.ss and check_engine_up, respectively. The string "rc=137" at the end of the last line is an example of a non-zero return code.

Program Options

The check_cron script supports the following options:

-v -- Check_cron prints a verbose report. For each job, the report shows the user, the command, the time the job started, the job's duration, and the return code.

-r -- Check_cron considers only the most recent invocation of each job.

-h hours, -m minutes -- With these options, you can specify a time interval other than the 60-minute default. Only jobs that completed within the specified time window prior to the current date/time are considered.

-u users -- Check_cron processes only jobs for the specified user(s). When supplying a list of users, separate user names with white space and enclose the list in quotes.

-a -- Check_cron calls create_alert_file to create an alert when cron jobs generate non-zero return codes. This requires the monitoring framework as described in "Managing Enterprise Alerts" (Sys Admin, April 2004: http://www.samag.com/documents/s=9102/sam0404f/.

Example 1

check_cron -r -v -m 120 | tail -15

In this example, check_cron displays a verbose report that includes information about the most recent invocation of each job that completed within the past 120 minutes. Only the last 15 lines of the report are displayed.

The output would look something like this:

user=keysync
command=/usr/local/bin/screen -m -d -S keysync_put \
  /opt/keysync/bin/keysync_put > /var/log/cron/keysync/keysync_put 2>&1
start=11:46:00 17 Feb 2004 Tue
duration=0 sec
rc=0

user=root
command=/opt/admin/sbin/skulker >> /var/log/cron/root/skulker 2>&1
start=11:45:00 17 Feb 2004 Tue
duration=2 sec
rc=0

All cron jobs that completed in the past 2 hours returned 0.
Example 2

check_cron -h 3 -u "root monitor" -a

In this example, check_cron considers all jobs that completed in the last 3 hours and were executed by the users root or monitor. A one-line message is displayed indicating whether all of the jobs completed with a non-zero return code. If not, the create_alert_file program is called to generate an alert.

The output would look something like this:

All cron jobs that completed in the past 3 hours returned 0
How check_cron Works
Set the Time Window

Check_cron establishes a limit expressed in a number of seconds that will be used to determine which jobs to process and which ones to ignore. By default, this limit, or time window, is set to 3600 seconds. The window can be changed using the -h or -m options as described above.

Partition the Log File

Next, check cron splits the contents of /var/cron/log into two temporary files -- one containing lines written to the log file when jobs start, and the other containing lines written to the log file when jobs finish:

grep "^<" /var/cron/log | tail -r > $TEMP_FINISH
grep "^>" /var/cron/log | tail -r > $TEMP_START

Piping the output of the grep commands through tail -r reverses the order of the output so the most recent jobs are at the top of the file. If the -u option is used, the output is filtered to include only jobs associated with the specified user(s).

Identify Jobs within the Window

After the log file has been partitioned, the current date/time is noted and the file containing information about completed jobs is processed. A while loop reads each record from TEMP_FINISH, parsing the record into components. Strings of the form YYYY/MM/DD HH:MM:SS are passed to a script called datediff. The datediff program calculates the number of seconds between the current time and the time the job completed. If the number of seconds is larger than the time limit, then check_cron ignores the information and the loop terminates. (See the "Auxiliary Scripts" sidebar for a description of the datediff and secs2dhms utilities).

Connect Entries in TEMP_FINISH to Entries in TEMP_START

For each job that completes within the time window, TEMP_START is processed to obtain the job's starting statistics. If the -r option is in effect, only the most recent invocation of a particular job is processed.

Lines from TEMP_FINISH are connected to lines from TEMP_START using the process id and user fields. Since process ids may be recycled, this approach has the potential for error. In practice, we have not found this to be a significant issue. Whether this is a problem depends on the time window, how long it takes for process ids to be recycled on the system, and how important it is that the output of check_cron be completely accurate 100% of the time.

Process the Data for Each Job

The duration of each job is determined by passing the date/times that the job started and finished to datediff. The user, command, start time, duration, and return code are written to another temporary file (variable $command_file). The pathname of this temporary file is a combination of the $TEMP_DIR string, ".check_cron", the message digest of the cron job command, and the process id of the check_cron process. The md5 command provides the message digest in the form of several hexadecimal digits.

The reason the message digest is embedded in the filename is so that each command will map to a unique file name and multiple invocations of the same command will map to the same file name. The cron_check program uses the presence or absence of a particular temporary file to determine whether an invocation of a job is the most recent one.

If the -v option is in effect, the contents of the temporary file are written to standard output. If the job's return code is non-zero and the -a option is in effect, the temporary file is appended to the TEMP_ALERTS file.

Summarize the Results

After all jobs are processed, if the TEMP_ALERTS file exists, then create_alert_file is called. Remember that the TEMP_ALERTS file isn't created unless the -a option is used.

Finally, a one-line message is written to standard output indicating whether all jobs completed with a non-zero return code. The message includes information about the time window used. Because seconds can be a cumbersome unit to work with, the time window is converted from seconds to days, hours, minutes, and seconds using the secs2dhms script (Listing 3).

Conclusion

In the Unix System Administration Handbook (3rd edition), the authors state, "The [cron] log file grows quickly and is rarely useful; leave logging turned off unless you're debugging a specific problem." Indeed, on many systems cron log files can grow very quickly, but that's what log-trimming programs are for. The natural way to solve the problem is to set up a cron job that periodically checks the size of the file and trims it when it becomes too big.

To make sure that cron job runs as expected, you can now use check_cron to monitor the log-trimming job. You can just set up a cron job to run check_cron periodically and have it send you an alert whenever a job fails. So, those cron log files aren't so useless after all. Now if we could only figure out how to monitor the check_cron cron job.

References

Spurgeon, John and Ed Schaefer. "Managing Enterprise Alerts", Sys Admin, April 2004 -- http://www.samag.com/documents/s=9102/sam0404f/

Douglass, Kyle and Ed Schaefer. "Date Arithmetic with the Shell", Sys Admin, July 2003 -- http://www.samag.com/documents/s=8284/sam0307b/0307b.htm

A description of message digests and the md5 utility -- http://www.fourmilab.ch/md5/

Nemeth, Evi, Garth Snyder, et al. 2001. Unix System Administration Handbook. Upper Saddle River, NJ: Prentice Hall PTR.

John Spurgeon is a software developer and systems administrator for Intel's Factory Integrated Information Systems, FIIS, in Aloha, Oregon. Outside of work, he enjoys turfgrass management, triathlons, and spending time with his family.

Ed Schaefer is a frequent contributor to Sys Admin. He is a software developer and DBA for Intel's Factory Integrated Information Systems, FIIS, in Aloha, Oregon. Ed also hosts the monthly Shell Corner column on UnixReview.com. He can be reached at: shellcorner@comcast.net.