Automating Web Reports with Analog

Isaac Sacolick

At my organization, we have a number of Web applications that we sell to newspapers as an Application Service Provider (ASP). These products were developed over a six-year period, and some were acquired from other companies. The result is that these products were developed and run on different software platforms and system architectures. Additionally, we have several traffic-reporting tools and procedures in place, each one having different issues but also different strengths. We were tasked to look at different ways of providing a consolidated traffic reporting solution that would work with our different products but not require a massive re-engineering of the applications or the back-end tools.

We decided to keep things simple and pilot an approach on one of the company's products. We aimed to develop a fairly generic approach that was loosely coupled from the application environment so that integrating the other products would be feasible. We knew that using Web logs would be the easiest approach and focused our early efforts on the procedures for managing the log files.

We chose to use the free logfile analysis program Analog, available at http://analog.cx, as the core of our reporting system. My company had experimented with two different commercial traffic-analysis packages without much success. Outsourcing seemed like a quick and easy way to provide detailed traffic reports, but the costs were high and ultimately would not bring us closer to providing reports that would tie traffic metrics to other product metrics. We had already experimented with Analog to perform quick product performance reports, so we already knew Analog's capabilities and strengths.

Although there are other open source traffic-reporting systems with better reporting features, Analog's simplicity really made it attractive. At its core, it parses one or more log files, supports various types of filters, and produces a large number of reporting sections and output formats. A single configuration file defines all of the processing parameters, rulers, and reporting features. One important feature is its ability to develop a cache of the raw reporting data. This enables you to parse the log files once, create a cache output file, then run Analog on the cache file(s) to generate reports. Cache files are smaller than the original log files, and processing them is significantly faster.

Some other features that we liked about Analog are as follows:

It's incredibly fast, and we could run months worth of data on relatively inexpensive hardware. Some of the recovery and rerunning features of commercial products, however, were slower and far more complex.
It has excellent documentation, an active user community, and is on its fifth major version.
Analog has several output formats including a fielded data output. We knew we could use Analog to generate the reports, but when we were ready to join traffic data with other product metrics, we could use Analog to generate raw summary information. We could then import the summary data into our data warehouses and develop higher level reports.

Implementation

After choosing Analog as the log processing application, we defined other parts of the system. Figure 1 shows a simplified view of the network and system architecture. Web servers and Web application servers log to their local disks and their log rotation times are standardized. We have two core processing scripts. A log management script pulls the logs from each server and copies them to a redundant NAS. The backup server is then tasked with a job to back up the new log files. The data processing server runs a second log processing script that automates processing each day's logs. At the end of this process, the daily batch reports are generated and stored back on the NAS. The Reporting Server is both an intranet and extranet server that allows internal users and customers to view reports. The Data Processing and Reporting Linux Servers have identical configurations that provide failover capability if needed. The entire configuration runs on a backnet VLAN so that reporting network traffic does not interfere with Web-based traffic.

The log management script is a relatively straightforward script that moves data files to the NAS and triggers the backup process. The log processing script manages the automated processing of the new logs and provides a simple interface for configuring and running batch reports. The script supports several independent workflows that can be chained together by running it with the appropriate parameters. Workflows supported include:

1. Ensure that all log files are present for a given day.

2. Create a directory tree that organizes the logs by date and by server.

3. Break each log up by virtual host and store these in a new directory structure.

4. Run Analog on a day's worth of logs from each virtual server and create the cached output.

5. Create a report covering one or more virtual hosts for a specific date range.

There are many ways to support steps 1 and 2 depending on your server architecture, naming conventions, and reporting needs. Step 3 was developed to optimize the process and to facilitate giving customers access to their log files. This step may not be required when managing a limited number of virtual hosts or when the Web servers are configured to log each virtual host to a separate log file. Steps 4 and 5 show our approach for simplifying and automating the nightly jobs and may be useful optimizing the nightly log processing jobs and configuring Analog to deliver batch reports.

After step 3, the logs are stored in a directory structure that looks like:

/logs_by_vhost/Virtualhost/YYYY-MM-DD

Inside this directory are all log files for this virtual host on the specific day. We wanted to create an output cache repository that looked like:

/cache_by_vhost/VirtualHost/YYYY/MM/YYYY-MM-DD.cache

where the file "YYYY-MM-DD.cache" is the file Analog creates for the virtual host for data covering a specific day.

To create a cache file, set up an Analog configuration file to produce a cached output. You can automate this by setting up a "template" Analog configuration file, then running a function in the script that does the following:

Makes a list of all virtual hosts to run on
For each virtual host:
Loads in the template Analog configuration file into a string
Substitutes the runtime parameters for creating the cache for this run
Saves this "instance" version of the Analog configuration file
Runs Analog, which creates the cache file

Listing 1 shows how to create the instance version of the configuration file. This function takes three arguments: the input template filename, the output template filename, and a hash of values to substitute. On running the cache, we set the following variables:

My %config_substitutes = (
LOGFILE         => "$PATH/logs_by_vhost/$vhost,$date/*",
FROM            => "$date",
TO              => "$date",
CACHEOUTFILE    => "$PATH/cache_by_vhost/$year/$month/$date.cache",
VHOSTINCLUDE    => "$vhost",
HOSTNAME        => "$hostname",
HOSTURL         => "http://$vhost",
OUTFILE         => "none";
);

If we have two strings, $infile being the input analog template and $outfile the output instance version of the config file, the routine can be called using:

createAnalogConfigFile(\$infile, \$outfile, \%config_substitutes);

The key parts of this substitution are the Analog configuration parameters that are based on virtual host or date. You may find other parameters that need to be modified, but the process should be similar. After creating the instance configuration file, running analog using this file generates the cache file. We set up this process to run each evening after the log files are archived. This minimizes the amount of processing and optimizes the environment to run batched and ad hoc reports over an arbitrary date range.

Running Reports

Developing reports runs through a similar process as creating the cache. The program takes the virtual host, a "from" date, and a "to" date as parameters. We then create a config_substitutes hash that starts with something like:

My %config_substitutes = (
LOGFILE         => "none",
FROM            => "$from",
TO              => "$to",
CACHEOUTFILE    => "none",
VHOSTINCLUDE    => "$vhost",
HOSTNAME        => "$hostname",
HOSTURL         => "http://$vhost",
OUTPUT          => "HTML"
OUTFILE         => "$PATH/report_by_customer/$vhost/$vhost-$from-$to.html",
);

Other parameters are set to turn on/off specific reports and to set other parameters. The only step left is to tell Analog which cache files to load for this report. The cache files are set using the CACHEFILE parameter. To run on a date range, a CACHEFILE directive is needed for every cache file that covers the virtual host (or group of them) and date within the date range. Listing 2 shows how to create the CACHEFILE parameter string. From the $from and $to dates, an array of dates is created, which has all dates between these two dates. A set of CACHEFILE parameters is then created, one for every date in the array. This string can then be added to the hash as:

$config_substitutes(CACHEFILE) = \
  createCacheFileStr("$PATH/report_by_customer", $vhost, $from, $to);

Analog then runs and creates an output report (in this example, an HTML report), which is stored in a report output directory "report_by_customer/$vhost".

Viewing the Reports

Because the reports are archived in directories by virtual host, we configured our Report Server with the Apache Web server. We allow internal users and customers to log in, view the appropriate reporting directories, select, and review the reports. We also configured the reporting script to run as both a command-line application and a CGI. Internal users can enter reporting parameters into a Web form, submit to the application, and generate reports on the fly.

Conclusions

From here, there are many features that can be added to this basic framework. Here are some examples:

If you rotate servers frequently, consider building a database that tracks this activity. You can then enhance the quality control steps by checking the database and ensuring that the proper log files were rotated and are available before generating the cache.
If you have many different batch reports to run nightly, consider building a database that lists the reports, duration, frequency, and configuration for the reports.
If you want to join traffic reporting metrics with other product metrics, then use Analog's computer output and then load the data into your database.

Also, consider exploring the many helper applications that Analog lists at their Web site:

http://analog.cx/helpers/index.html

Whatever solution you implement, keep in mind the basic principles of ensuring a quality log management process, automate that process, and engage all parties to prioritize the reporting features and management utilities.

Acknowledgements

Special thanks to several individuals who contributed to this work, including Bill Castagne, Wei Guo, Nanda Mounasamy, Kevin Phillips, Wayne Regencia, Oneill Stanleigh, Viktoria Stern, and Yvette Vacca.

Isaac Sacolick is the Chief Technology Officer of PowerOne Media Inc, a leading Application Service Provider for newspapers, and also serves as an advisor to several technology startup companies. His professional interest and expertise is in publishing systems, application networks, data warehousing, process automation, and artificial intelligence.