Monitoring
Solaris Cron Logs with check_cron
John Spurgeon and Ed Schaefer
Running jobs with cron is convenient. So convenient that we tend
to forget about it and assume that jobs are being run as planned.
But do we know for sure? Did that backup job really run like it
was supposed to? Are those pesky log files still being trimmed?
Probably -- but maybe not. In this article, we present a Korn shell
script called check_cron that can give even the most paranoid systems
administrator a little more peace of mind.
You probably know that the cron command starts a process that
executes commands at specified dates and times. However, did you
know that some versions of Unix can keep a log file listing the
commands that were executed and the times at which they ran? On
Solaris, this information is written to /var/cron/log.
The check_cron program (Listing 1) processes entries in /var/cron/log
for all jobs that recently terminated. The script prints a message
indicating whether the jobs appear to have completed successfully.
Optionally, check_cron can print detailed statistics about each
job and generate alerts when non-zero return codes are encountered.
Contents of /var/cron/log
The line CRONLOG=YES in /etc/default/cron turns on logging. With
logging turned on, the following information is written to /var/cron/log
each time a job executes:
- The command as it appears in the user's crontab file
- A user id
- A process id
- The date and time the job started and ended
A non-zero return code is written to the log file if there was
a problem executing the job (permission problems, path issues, etc.),
or if the command itself completed with a non-zero exit code.
To keep the log file from growing too large, the script /etc/cron.d/logchecker
is typically run as a root cron job. Because check_cron processes
the entire log file, performance is proportional to the size of
the log file.
Here is a sample log file entry:
> CMD: /usr/local/bin/sample.ss
> root 28170 c Wed Jan 21 13:49:00 2004
> CMD: /opt/monitor/bin/check_engine_up -a >> /var/log/mon/ck_log 2>&1
> monitor 28171 c Wed Jan 21 13:49:00 2004
< monitor 28171 c Wed Jan 21 13:49:23 2004
< root 28170 c Wed Jan 21 13:49:26 2004 rc=137
Lines beginning with a greater than sign (>) are written to the
file when the job starts. Lines beginning with a less than sign (<)
are written to the file when the job completes.
Lines 1-2 are appended to the file when the sample.ss program
is started by cron. Lines 3-4 are appended to the file when cron
starts check_engine_up. Line 5 is appended when the check_engine_up
job completes, and line 6 is appended when sample.ss completes.
Note how process ids 28170 and 28171 are used to tie together the
lines for sample.ss and check_engine_up, respectively. The string
"rc=137" at the end of the last line is an example of a non-zero
return code.
Program Options
The check_cron script supports the following options:
-v -- Check_cron prints a verbose report. For each job,
the report shows the user, the command, the time the job started,
the job's duration, and the return code.
-r -- Check_cron considers only the most recent invocation
of each job.
-h hours, -m minutes -- With these options, you
can specify a time interval other than the 60-minute default. Only
jobs that completed within the specified time window prior to the
current date/time are considered.
-u users -- Check_cron processes only jobs for the specified
user(s). When supplying a list of users, separate user names with
white space and enclose the list in quotes.
-a -- Check_cron calls create_alert_file to create an alert
when cron jobs generate non-zero return codes. This requires the
monitoring framework as described in "Managing Enterprise Alerts"
(Sys Admin, April 2004: http://www.samag.com/documents/s=9102/sam0404f/.
Example 1
check_cron -r -v -m 120 | tail -15
In this example, check_cron displays a verbose report that includes
information about the most recent invocation of each job that completed
within the past 120 minutes. Only the last 15 lines of the report
are displayed.
The output would look something like this:
user=keysync
command=/usr/local/bin/screen -m -d -S keysync_put \
/opt/keysync/bin/keysync_put > /var/log/cron/keysync/keysync_put 2>&1
start=11:46:00 17 Feb 2004 Tue
duration=0 sec
rc=0
user=root
command=/opt/admin/sbin/skulker >> /var/log/cron/root/skulker 2>&1
start=11:45:00 17 Feb 2004 Tue
duration=2 sec
rc=0
All cron jobs that completed in the past 2 hours returned 0.
Example 2
check_cron -h 3 -u "root monitor" -a
In this example, check_cron considers all jobs that completed in the
last 3 hours and were executed by the users root or monitor. A one-line
message is displayed indicating whether all of the jobs completed
with a non-zero return code. If not, the create_alert_file program
is called to generate an alert.
The output would look something like this:
All cron jobs that completed in the past 3 hours returned 0
How check_cron Works
Set the Time Window
Check_cron establishes a limit expressed in a number of seconds that
will be used to determine which jobs to process and which ones to
ignore. By default, this limit, or time window, is set to 3600 seconds.
The window can be changed using the -h or -m options
as described above.
Partition the Log File
Next, check cron splits the contents of /var/cron/log into two
temporary files -- one containing lines written to the log file
when jobs start, and the other containing lines written to the log
file when jobs finish:
grep "^<" /var/cron/log | tail -r > $TEMP_FINISH
grep "^>" /var/cron/log | tail -r > $TEMP_START
Piping the output of the grep commands through tail -r reverses
the order of the output so the most recent jobs are at the top of
the file. If the -u option is used, the output is filtered
to include only jobs associated with the specified user(s).
Identify Jobs within the Window
After the log file has been partitioned, the current date/time
is noted and the file containing information about completed jobs
is processed. A while loop reads each record from TEMP_FINISH, parsing
the record into components. Strings of the form YYYY/MM/DD HH:MM:SS
are passed to a script called datediff. The datediff program calculates
the number of seconds between the current time and the time the
job completed. If the number of seconds is larger than the time
limit, then check_cron ignores the information and the loop terminates.
(See the "Auxiliary Scripts" sidebar for a description of the datediff
and secs2dhms utilities).
Connect Entries in TEMP_FINISH to Entries in TEMP_START
For each job that completes within the time window, TEMP_START
is processed to obtain the job's starting statistics. If the -r
option is in effect, only the most recent invocation of a particular
job is processed.
Lines from TEMP_FINISH are connected to lines from TEMP_START
using the process id and user fields. Since process ids may be recycled,
this approach has the potential for error. In practice, we have
not found this to be a significant issue. Whether this is a problem
depends on the time window, how long it takes for process ids to
be recycled on the system, and how important it is that the output
of check_cron be completely accurate 100% of the time.
Process the Data for Each Job
The duration of each job is determined by passing the date/times
that the job started and finished to datediff. The user, command,
start time, duration, and return code are written to another temporary
file (variable $command_file). The pathname of this temporary file
is a combination of the $TEMP_DIR string, ".check_cron", the message
digest of the cron job command, and the process id of the check_cron
process. The md5 command provides the message digest in the form
of several hexadecimal digits.
The reason the message digest is embedded in the filename is so
that each command will map to a unique file name and multiple invocations
of the same command will map to the same file name. The cron_check
program uses the presence or absence of a particular temporary file
to determine whether an invocation of a job is the most recent one.
If the -v option is in effect, the contents of the temporary
file are written to standard output. If the job's return code is
non-zero and the -a option is in effect, the temporary file
is appended to the TEMP_ALERTS file.
Summarize the Results
After all jobs are processed, if the TEMP_ALERTS file exists,
then create_alert_file is called. Remember that the TEMP_ALERTS
file isn't created unless the -a option is used.
Finally, a one-line message is written to standard output indicating
whether all jobs completed with a non-zero return code. The message
includes information about the time window used. Because seconds
can be a cumbersome unit to work with, the time window is converted
from seconds to days, hours, minutes, and seconds using the secs2dhms
script (Listing 3).
Conclusion
In the Unix System Administration Handbook (3rd edition),
the authors state, "The [cron] log file grows quickly and is rarely
useful; leave logging turned off unless you're debugging a specific
problem." Indeed, on many systems cron log files can grow very quickly,
but that's what log-trimming programs are for. The natural way to
solve the problem is to set up a cron job that periodically checks
the size of the file and trims it when it becomes too big.
To make sure that cron job runs as expected, you can now use check_cron
to monitor the log-trimming job. You can just set up a cron job
to run check_cron periodically and have it send you an alert whenever
a job fails. So, those cron log files aren't so useless after all.
Now if we could only figure out how to monitor the check_cron cron
job.
References
Spurgeon, John and Ed Schaefer. "Managing Enterprise Alerts",
Sys Admin, April 2004 -- http://www.samag.com/documents/s=9102/sam0404f/
Douglass, Kyle and Ed Schaefer. "Date Arithmetic with the Shell",
Sys Admin, July 2003 -- http://www.samag.com/documents/s=8284/sam0307b/0307b.htm
A description of message digests and the md5 utility -- http://www.fourmilab.ch/md5/
Nemeth, Evi, Garth Snyder, et al. 2001. Unix System Administration
Handbook. Upper Saddle River, NJ: Prentice Hall PTR.
John Spurgeon is a software developer and systems administrator
for Intel's Factory Integrated Information Systems, FIIS, in Aloha,
Oregon. Outside of work, he enjoys turfgrass management, triathlons,
and spending time with his family.
Ed Schaefer is a frequent contributor to Sys Admin.
He is a software developer and DBA for Intel's Factory Integrated
Information Systems, FIIS, in Aloha, Oregon. Ed also hosts the monthly
Shell Corner column on UnixReview.com. He can be reached at: shellcorner@comcast.net. |