Building a Web Log Rotation System

Reinhard Voglmaier

Webmasters want to know how Web servers spend their time, and the Webmasters aren't the only ones who are curious. Customers (internal or external) want to know how often their pages are visited and how the visitors navigate within the sites. All this useful information is kept in log files. Webmasters, therefore, love to keep log files for as long as possible, but systems administrators have a different view of these bulky files. Systems administrators want to minimize the space the log files occupy and move the files to backup as soon as possible.

The result is a system that calls for Webmasters to:

Produce and publish statistics for users
Analyze the log files for administrative purposes (e.g., to track down performance questions)
Analyze the log files for security concerns (e.g., to discover attacks against the Web servers)
Delete log files that aren't needed anymore, and archive log files you need to save

In this article, I'll describe how to build a system for managing log file rotation. I'll also discuss a free tool for analyzing log files and give some hints on log file management. I will focus on log files for the Apache Web server, but the logic applies also to other types of log files, since the underlying requirements are usually the same. The scripts described in this article are available for download at http://www.sysadminmag.com/code/.

Rotating Log Files

The Web server produces one or more log files. Because log files tend to grow, it is wise to compress the log files from time to time and switch to fresh files. The old log files get archived and, from this archived log file, you can later produce statistics. You will also need a strategy for how to manage the archive. The story may be somewhat more complicated, but the main components are shown in Figure 1. Each of these components can further be broken into smaller pieces.

The management of log files consists of a log file rotation mechanism and a means for converting the file to a format suitable for subsequent processing. The term "log file rotation" may be somewhat misleading -- there's no real "rotation" of log files in the sense that you rotate tapes in a tape drive, but the "rotation" metaphor is nevertheless appropriate for the task of moving older data to an archive to make room for newer data. At first glance, this log file rotation task seems easy to achieve. You close the Web server, move the log files to a different place, and start the Web server again. If you have only one Web server implementing only one logical site, the task at hand seems rather simple.

I'll assume that the Web server produces one access log file reporting the traffic and one error log file reporting the errors. "ddmmyy" represents the date in the format day, month, and year. The command to start or stop the Apache Web server is apachectl <stop/start>. The next few lines show this simple approach:

today='date -f ddmmyy'
apachctl stop
mv accesslog accesslog.${today}
mv errorlog errorlog.${today}
apachectl start

This solution is ready to work, however, I will apply some small changes before going into production. If I don't want to interrupt the Web services in such an abrupt way, but would instead prefer to let the server continue to service the ongoing requests, I can add code to restart the Apache Web server gracefully. To do this, I only need a slight modification to the above lines:

today='date -f ddmmyy'
mv accesslog accesslog. ${today}
mv errorlog errorlog. ${today}
apachectl graceful
sleep 600
gzip accesslog. ${today}
gzip errorlog. ${today}

The sleep is necessary, since the Web server may still be using the log file for a pending request.

This solution works fine if I have only one Web server without virtual hosts. Life isn't so easy, however -- many installations have more than one Web server. Each of those Web servers can provide several virtual hosts, resulting in several log files per server.

We can generalize our script to handle the case of multiple Web servers as follows:

today='date -f ddmmyy'
foreach WebServer in <List of Webserver>
    foreach LogFile of <Logfiles of $WebServers>
        mv $LogFile $LogFile.{today}

    $WebServer restart graceful
    sleep 600
    foreach LogFile of <Logfiles of $WebServers>
        gzip $LogFile.{today}

In this solution, the procedure continues to sleep (see line 6). If we cycle two times through the Web servers, we only need one sleep. There is, however, a new problem with this approach. If I compile statistics on a daily basis, I will want the log file to end at hour "23.59.59" and the new log file to begin at "00.00.00." If I have a number of Web servers and each server has a large number of log files, the double loop "foreach server" and "foreach logfile" will take some time. The graceful restart of the server also adds time. The end result is that it will not be possible to ensure that all the log files end at exactly the last second of the day.

The solution to this problem is very easy. I will rotate the files later, for example, at 01.00 in the morning. The log file will therefore have too much data (not only the data from the previous day, but also some data from the current day):

today='date -f ddmmyy'
yesterday='date -f ddmmyy -1'

foreach WebServer in <List of Webserver>
    foreach LogFile of <Logfiles of $WebServers>
        mv $LogFile $LogFile.TMP

    $WebServer restart graceful

sleep 600

foreach WebServer in <List of Webserver>
    foreach LogFile of <Logfiles of $WebServers>
        cat $LogFile.$yesterday $LogFile.TMP > $LogFile.TMP
        split($LogFile.TMP,
              $LogFile.$today,
              $LogFile.$yesterday)
        gzip $LogFile.{yesterday}

Of course, this approach requires separate post-processing of the log files to assemble the pieces into a single one-day backup. The second loop contains the tricky part. The cat command prepends the log file for yesterday with the TMP file we produced today. The TMP file now contains all the data from yesterday, in addition to data from today. We then split the data to create a complete log for yesterday, compress the data from yesterday, and hold the data from today for the run of tomorrow. The split program could do other things as well, to ease later elaboration, for example resolve IP numbers in names or similar tasks. Later on I will turn back to this issue.

Getting the Server Information

So far, I haven't explained where to get the information on the servers. There are two approaches, each with advantages and disadvantages. The first approach is to use a configuration file. The configuration file contains a list of Web servers and the log files for each Web server. On my site, I use a different approach. The alternative is to rely on the syntax in the installation directories. See Figure 2 for an idea of what I mean.

All Web servers are under the directory /opt/apache/http. All configuration files are in /opt/apache/http/etc/<server>. All log files are in /opt/apache/http/var/<server>/log. At the end, the stop/start/restart scripts are in /opt/apache/http/sbin/<server>.

Cycling through the /opt/apache/http/sbin/ directory delivers a list of all Web servers. In some cases, however, you may not want to rotate the logs for all the available Web servers. On my site, for example, I distinguish between active servers and inactive servers. I keep this information in a simple file called apache.conf. The script (let's call it conf_apache) simply echoes all active Web servers for use in the while loop. The following script does the job. A server with a 1 value is an active server, and a 0 setting identifies an inactive server:

#Document Server Site 1
httpd_site1=1
# Document Server Site 2
httpd_site2=1

if [ "$httpd_site1" -eq 1]
then
     echo $httpd_site1
fi

if [ "$httpd_site2" -eq 1]
then
     echo $httpd_site2
fi

You can use the same configuration script in the boot process or shutdown process to start and stop the Web servers. The foreach line looks like this:

foreach WebServer in 'conf_apache'

From the /opt/apache/http/etc/<server>/httpd.conf configuration file, you obtain a list of all the log files "<server>" is using. At the end of the file, the command /opt/apache/http/sbin/apachectl graceful restarts the server.

The Best of Both Worlds

In the preceding section, I mentioned another option for getting the list of Web servers and their configuration files. This option is simply writing a configuration file containing the server and log file information. The configuration file has the disadvantage of forcing you to maintain the same information in two locations. You have to maintain the list in the configuration files of the Web servers and also in the configuration file for the log rotation. You can, however, combine the two methods to obtain a solution offering the advantages of both.

I will use a modified script to produce the configuration file for the log rotation. I'll create a configuration file in XML format that contains the list of Web servers installed on my system, together with a setting indicating whether the Web server is active. I also have one XML file for every Web server. This server XML file contains the information on where the log files are located and what the rotation procedure should do with the log files.

If I add a new server or a virtual host, I update the XML configuration file accordingly. Existing information is not overwritten. A Perl script uses the XML files to produce two lists (a list of servers and a list of log files) that the script needs for the log rotation.

This XML approach is the more flexible of the two solutions I implemented on my site. It is still in the experimental phase and will get some interesting new features in the next few months. The main reason I am using XML is because it is human-readable. With a little work, you can present the XML in a nice form for the browser. Last, but not least, XML is standard -- many existing libraries can read and write XML files.

Listing 1 shows the XML configuration file that lists the servers requiring log rotation. Listing 2 shows an example file for a Web server. Listing 3 shows the example of a Tomcat application server, and is intended to demonstrate that this approach applies to other services as well as Web services.

As I mentioned, the configuration can also provide information showing whether to disable or enable rotation of a particular log file. You can also add instructions to stop or start the Web server, and, at the end of the file, you can add information to indicate whether to copy the log files onto a remote log server or process the files for later reuse.

Processing the Log Files

I will briefly cover what to do with the log files after the log file rotation. There are several programs for processing log files, so I won't describe these programs in detail. The product I use is Access Watch from Dave Maher (http://www.accesswatch.com/), which is written in Perl. The license fee is low, and you can modify it to fit your particular needs. It is made up of two parts -- the first part scans the log file and constructs data structures used for the production of the statistics. The second part produces the statistics and their graphical representation. The graphical display feature is particularly useful because it supports standard Web elements.

Recall that the log rotation procedure splits the log file into two parts. The first part is used for statistics; the second part, the actual access data, is preserved for later processing. To process the access data, you have to parse the log file. Once you parse the data, you could produce the necessary data structures and keep them in a database for further processing. I am working on such a solution. The clean separation of the Perl code in Access Watch lends itself to this database technique.

Conclusion

Many services require the maintenance of log files. This maintenance includes procedures to keep the log files small and to handle the log files later. The handling of the log files includes the production of statistics and the final storage on media, such as magnetic tapes or DVD. In this article, I've illustrated a log file rotation solution that I use. The configuration file format is XML. For every Web server, I provide an XML file containing all the necessary information to stop or start the server and to process the log files. You can also use this solution for other services that need log file rotation. I use it for a Tomcat application server. The scripts for the maintenance of the configuration files can be downloaded from the Sys Admin Web site (http://www.sysadminmag.com/code/).

Reinhard Voglmaier studied physics at the University of Munich in Germany and graduated from Max Planck Institute for Astrophysics and Extraterrestrial Physics in Munich. After working in the IT department at the German University of the Army in the field of computer architecture, he was employed as a Specialist for Automation in Honeywell and then as a Unix Systems Specialist for performance questions in database/network installations in Siemens Nixdorf. Currently, he is responsible of LDAP Services at GlaxoSmithKline, Italy. He can be reached at: rv33100@gsk.com.