Building
a Web Log Rotation System
Reinhard Voglmaier
Webmasters want to know how Web servers spend their time, and
the Webmasters aren't the only ones who are curious. Customers (internal
or external) want to know how often their pages are visited and
how the visitors navigate within the sites. All this useful information
is kept in log files. Webmasters, therefore, love to keep log files
for as long as possible, but systems administrators have a different
view of these bulky files. Systems administrators want to minimize
the space the log files occupy and move the files to backup as soon
as possible.
The result is a system that calls for Webmasters to:
- Produce and publish statistics for users
- Analyze the log files for administrative purposes (e.g., to
track down performance questions)
- Analyze the log files for security concerns (e.g., to discover
attacks against the Web servers)
- Delete log files that aren't needed anymore, and archive log
files you need to save
In this article, I'll describe how to build a system for managing
log file rotation. I'll also discuss a free tool for analyzing log
files and give some hints on log file management. I will focus on
log files for the Apache Web server, but the logic applies also
to other types of log files, since the underlying requirements are
usually the same. The scripts described in this article are available
for download at http://www.sysadminmag.com/code/.
Rotating Log Files
The Web server produces one or more log files. Because log files
tend to grow, it is wise to compress the log files from time to
time and switch to fresh files. The old log files get archived and,
from this archived log file, you can later produce statistics. You
will also need a strategy for how to manage the archive. The story
may be somewhat more complicated, but the main components are shown
in Figure 1. Each of these components can further be broken into
smaller pieces.
The management of log files consists of a log file rotation mechanism
and a means for converting the file to a format suitable for subsequent
processing. The term "log file rotation" may be somewhat misleading
-- there's no real "rotation" of log files in the sense that you
rotate tapes in a tape drive, but the "rotation" metaphor is nevertheless
appropriate for the task of moving older data to an archive to make
room for newer data. At first glance, this log file rotation task
seems easy to achieve. You close the Web server, move the log files
to a different place, and start the Web server again. If you have
only one Web server implementing only one logical site, the task
at hand seems rather simple.
I'll assume that the Web server produces one access log file reporting
the traffic and one error log file reporting the errors. "ddmmyy"
represents the date in the format day, month, and year. The command
to start or stop the Apache Web server is apachectl <stop/start>.
The next few lines show this simple approach:
today='date -f ddmmyy'
apachctl stop
mv accesslog accesslog.${today}
mv errorlog errorlog.${today}
apachectl start
This solution is ready to work, however, I will apply some small changes
before going into production. If I don't want to interrupt the Web
services in such an abrupt way, but would instead prefer to let the
server continue to service the ongoing requests, I can add code to
restart the Apache Web server gracefully. To do this, I only need
a slight modification to the above lines:
today='date -f ddmmyy'
mv accesslog accesslog. ${today}
mv errorlog errorlog. ${today}
apachectl graceful
sleep 600
gzip accesslog. ${today}
gzip errorlog. ${today}
The sleep is necessary, since the Web server may still be using the
log file for a pending request.
This solution works fine if I have only one Web server without
virtual hosts. Life isn't so easy, however -- many installations
have more than one Web server. Each of those Web servers can provide
several virtual hosts, resulting in several log files per server.
We can generalize our script to handle the case of multiple Web
servers as follows:
today='date -f ddmmyy'
foreach WebServer in <List of Webserver>
foreach LogFile of <Logfiles of $WebServers>
mv $LogFile $LogFile.{today}
$WebServer restart graceful
sleep 600
foreach LogFile of <Logfiles of $WebServers>
gzip $LogFile.{today}
In this solution, the procedure continues to sleep (see line 6). If
we cycle two times through the Web servers, we only need one sleep.
There is, however, a new problem with this approach. If I compile
statistics on a daily basis, I will want the log file to end at hour
"23.59.59" and the new log file to begin at "00.00.00." If I have
a number of Web servers and each server has a large number of log
files, the double loop "foreach server" and "foreach logfile" will
take some time. The graceful restart of the server also adds time.
The end result is that it will not be possible to ensure that all
the log files end at exactly the last second of the day.
The solution to this problem is very easy. I will rotate the files
later, for example, at 01.00 in the morning. The log file will therefore
have too much data (not only the data from the previous day, but
also some data from the current day):
today='date -f ddmmyy'
yesterday='date -f ddmmyy -1'
foreach WebServer in <List of Webserver>
foreach LogFile of <Logfiles of $WebServers>
mv $LogFile $LogFile.TMP
$WebServer restart graceful
sleep 600
foreach WebServer in <List of Webserver>
foreach LogFile of <Logfiles of $WebServers>
cat $LogFile.$yesterday $LogFile.TMP > $LogFile.TMP
split($LogFile.TMP,
$LogFile.$today,
$LogFile.$yesterday)
gzip $LogFile.{yesterday}
Of course, this approach requires separate post-processing of the
log files to assemble the pieces into a single one-day backup. The
second loop contains the tricky part. The cat command prepends
the log file for yesterday with the TMP file we produced today. The
TMP file now contains all the data from yesterday, in addition to
data from today. We then split the data to create a complete log for
yesterday, compress the data from yesterday, and hold the data from
today for the run of tomorrow. The split program could do other things
as well, to ease later elaboration, for example resolve IP numbers
in names or similar tasks. Later on I will turn back to this issue.
Getting the Server Information
So far, I haven't explained where to get the information on the
servers. There are two approaches, each with advantages and disadvantages.
The first approach is to use a configuration file. The configuration
file contains a list of Web servers and the log files for each Web
server. On my site, I use a different approach. The alternative
is to rely on the syntax in the installation directories. See Figure
2 for an idea of what I mean.
All Web servers are under the directory /opt/apache/http. All
configuration files are in /opt/apache/http/etc/<server>.
All log files are in /opt/apache/http/var/<server>/log. At
the end, the stop/start/restart scripts are in /opt/apache/http/sbin/<server>.
Cycling through the /opt/apache/http/sbin/ directory delivers
a list of all Web servers. In some cases, however, you may not want
to rotate the logs for all the available Web servers. On my site,
for example, I distinguish between active servers and inactive servers.
I keep this information in a simple file called apache.conf. The
script (let's call it conf_apache) simply echoes all active Web
servers for use in the while loop. The following script does the
job. A server with a 1 value is an active server, and a 0 setting
identifies an inactive server:
#Document Server Site 1
httpd_site1=1
# Document Server Site 2
httpd_site2=1
if [ "$httpd_site1" -eq 1]
then
echo $httpd_site1
fi
if [ "$httpd_site2" -eq 1]
then
echo $httpd_site2
fi
You can use the same configuration script in the boot process or shutdown
process to start and stop the Web servers. The foreach line looks
like this:
foreach WebServer in 'conf_apache'
From the /opt/apache/http/etc/<server>/httpd.conf configuration
file, you obtain a list of all the log files "<server>" is using.
At the end of the file, the command /opt/apache/http/sbin/apachectl
graceful restarts the server.
The Best of Both Worlds
In the preceding section, I mentioned another option for getting
the list of Web servers and their configuration files. This option
is simply writing a configuration file containing the server and
log file information. The configuration file has the disadvantage
of forcing you to maintain the same information in two locations.
You have to maintain the list in the configuration files of the
Web servers and also in the configuration file for the log rotation.
You can, however, combine the two methods to obtain a solution offering
the advantages of both.
I will use a modified script to produce the configuration file
for the log rotation. I'll create a configuration file in XML format
that contains the list of Web servers installed on my system, together
with a setting indicating whether the Web server is active. I also
have one XML file for every Web server. This server XML file contains
the information on where the log files are located and what the
rotation procedure should do with the log files.
If I add a new server or a virtual host, I update the XML configuration
file accordingly. Existing information is not overwritten. A Perl
script uses the XML files to produce two lists (a list of servers
and a list of log files) that the script needs for the log rotation.
This XML approach is the more flexible of the two solutions I
implemented on my site. It is still in the experimental phase and
will get some interesting new features in the next few months. The
main reason I am using XML is because it is human-readable. With
a little work, you can present the XML in a nice form for the browser.
Last, but not least, XML is standard -- many existing libraries
can read and write XML files.
Listing 1 shows the XML configuration file that lists the servers
requiring log rotation. Listing 2 shows an example file for a Web
server. Listing 3 shows the example of a Tomcat application server,
and is intended to demonstrate that this approach applies to other
services as well as Web services.
As I mentioned, the configuration can also provide information
showing whether to disable or enable rotation of a particular log
file. You can also add instructions to stop or start the Web server,
and, at the end of the file, you can add information to indicate
whether to copy the log files onto a remote log server or process
the files for later reuse.
Processing the Log Files
I will briefly cover what to do with the log files after the log
file rotation. There are several programs for processing log files,
so I won't describe these programs in detail. The product I use
is Access Watch from Dave Maher (http://www.accesswatch.com/),
which is written in Perl. The license fee is low, and you can modify
it to fit your particular needs. It is made up of two parts -- the
first part scans the log file and constructs data structures used
for the production of the statistics. The second part produces the
statistics and their graphical representation. The graphical display
feature is particularly useful because it supports standard Web
elements.
Recall that the log rotation procedure splits the log file into
two parts. The first part is used for statistics; the second part,
the actual access data, is preserved for later processing. To process
the access data, you have to parse the log file. Once you parse
the data, you could produce the necessary data structures and keep
them in a database for further processing. I am working on such
a solution. The clean separation of the Perl code in Access Watch
lends itself to this database technique.
Conclusion
Many services require the maintenance of log files. This maintenance
includes procedures to keep the log files small and to handle the
log files later. The handling of the log files includes the production
of statistics and the final storage on media, such as magnetic tapes
or DVD. In this article, I've illustrated a log file rotation solution
that I use. The configuration file format is XML. For every Web
server, I provide an XML file containing all the necessary information
to stop or start the server and to process the log files. You can
also use this solution for other services that need log file rotation.
I use it for a Tomcat application server. The scripts for the maintenance
of the configuration files can be downloaded from the Sys Admin
Web site (http://www.sysadminmag.com/code/).
Reinhard Voglmaier studied physics at the University of Munich
in Germany and graduated from Max Planck Institute for Astrophysics
and Extraterrestrial Physics in Munich. After working in the IT
department at the German University of the Army in the field of
computer architecture, he was employed as a Specialist for Automation
in Honeywell and then as a Unix Systems Specialist for performance
questions in database/network installations in Siemens Nixdorf.
Currently, he is responsible of LDAP Services at GlaxoSmithKline,
Italy. He can be reached at: rv33100@gsk.com.
|