Cover V13, i07

Article

jul2004.tar

Scanning Apache Logs with PHP

Russell J.T. Dyer

On my Web site, I have a few key Web pages located in a directory requiring user authentication. For some documents, though, I want to know when they are accessed and who accessed them. For instance, I might put a business proposal in a private directory and send emails to several prospects asking them to read my proposal. So I can learn whether the document was viewed and which prospects viewed it, a PHP script scans my Apache access log regularly and sends an email to my cell phone telling me if it discovers a match. In this installment of my series on PHP, I will describe this PHP script to explain a little PHP programming and to give you some ideas on how PHP might be used for systems administration and log monitoring.

Getting Started

This PHP script won't be run through a browser, but will be executed by cron. The entry in crontab looks like this on my Linux server:

0,15,30,45 * * * * root /sbin/ck-apache-log.php
The opening four numbers separated by commas are the minute settings. The first asterisk that follows is a wildcard that means "every hour". So on the hour, as well as 15, 30, and 45 minutes after the hour, the specified script will be run by the root user. The other three asterisks represent days of the month, months of the year, and days of the week, respectively. That means this script will run every day.

The opening code for the script ck-apache-log.php follows. Because this script will not generate a Web page display, we need to give the opening sha-bang (#!) along with the path for php (which may be different on your server), and the -q option to prevent PHP from involving the Web server:

#!/usr/bin/php -q

<?php

    $dir = '/var/log/httpd';
    $log = "$dir/access_log";
    $ck_log = "$dir/php-ck-log.txt";
    $ck_list = "$dir/php-ck-list.txt";
    $ck_iplist = "$dir/php-ck-iplist.txt";

    $from = 'root@dyerhouse.com';
    $to = 'russell@dyerhouse.com';

    $access = array();
    $ips = array();
The first variable $dir contains the path to where our logs and other files are to be stored. The variable $log provides the name of the Apache log that PHP will be scanning. The variable $ck_log contains the name of the log in which PHP will record information on any user accesses that are found in the Apache log. The variable $ck_list names the data text file that will have a list of files for which we want our PHP script to search the Apache log. As for $ck_iplist, it contains a list of IP addresses with which we are familiar. This will be used to give a better display in the email messages that will be sent to us. The next pair of lines in our script establishes the variables that contain the email address of the server and the address of the person whom PHP will email. The last pair sets up two arrays, which will be filled with data later.

Pages to Watch

Setting aside the PHP script for a moment, let's set up a text file that will contain a list of pages for which PHP is to search. This will be a simple text file that we'll create with a text editor (like vi) in the /var/log directory and name php-ck-list.txt:

Business Proposal|business-proposal.html
Sales Plan|sales-plan.html
It's just a simple data text file with each record on a separate line. Each record only contains two fields of data separated by a vertical bar: the first is the name of the document and the second field is the file name. For simplicity, I've only listed two documents. Getting back to our PHP script, let's look at the next section of code in which PHP will read this text file and retrieve its data:

$PAGES = fopen("$ck_list", 'r')
         or die("Could not open page listing file.");

$line = rtrim(fgets($PAGES, 4096));

while(!feof($PAGES))
{
   list($name,$file_name) = split('\|', $line, 2);
   $pages[$name] = $file_name;
     
   $line = rtrim(fgets($PAGES, 4096));
}
fclose($PAGES);
In the first line of code above, we're establishing the file handle $PAGES that will be used to read each line of text from the data file that we just discussed. We're using the fopen() function to open the file specified in its first argument and in read mode, per the second argument given. If unsuccessful in opening the file, per the or component wrapped onto the next line, the script will die and display the error message we've provided. The next line uses the fgets() function to get the first line of data (i.e., 4k of data). Before storing that data temporarily in the variable $line, we use the rtrim() function to trim off the right-most character, the line feed.

PHP will now loop through each line of the data text file as long as we're not at the end of the file. This is accomplished with the feof() function coupled with the exclamation point as a negator. In the first line of the loop statement, we use the split() function to split out the Web page's name and it's file name, based on the vertical bar that separates them. The list() function will capture those values and store them temporarily in the variables named. Next, PHP stores these values in an associative array for retrieval later in the script. We end the while statement by retrieving another line from the file before starting the process over. When we reach the end of the text file, we use fclose() to close the file and to drop the file handle.

Familiar Hosts

The next task is to get a list of IP addresses with which we are already familiar. These are the IP addresses for the people that we emailed asking to look at our documents. Without this list, PHP would only be able to provide us with the IP address of the host that accessed the documents. The text file php-ck-iplist.txt is set up like php-ck-list.txt. The only difference is the data content -- each record will contain a field containing the expected user's name and then a vertical bar and then their server's IP address. Below is the section of code that will extract those records for use in the script:

$LIST = fopen("$ck_iplist", 'r')
        or die("Could not open host listing file.");

$line = rtrim(fgets($LIST, 4096));

while(!feof($LIST))
{
   list($host,$ip) = split('\|', $line, 2);
   $hosts[$ip] = $host;

   $line = rtrim(fgets($LIST, 4096));
}
fclose($LIST);
This section of code works like the previous section. The only differences are the names of the variables and the like. We're storing the results here in another associative array called $hosts. Incidentally, it would be a little more of an involved script, but we could use the host command to look up the host information on any IP address with which we're not familiar.

Scanning Apache

Now that PHP knows the file names for which it's searching and knows who should be hitting them, we can have PHP search the Apache log:

foreach($pages as $page_name => $file_name) {

   $LOG = fopen("$log", 'r')
          or die("Could not open the Apache log.");

   $line = rtrim(fgets($LOG, 4096));

   while(!feof($LOG))
   {
      if(ereg($file_name, $line)){
         preg_match('/(\d*.\d*.\d*.\d*) \- (\w*)
                    \[(\d*)\/(\w*)\/(\d{4})/', $line, $matches);

         $ip_addr = $matches[1];
         $htuser = $matches[2];
         $day = $matches[3];
         $month = $matches[4];
         $year = $matches[5];

         if(!isset($htuser)){
            $htuser = 'Anonymous';
         }

         if(!in_array("$ip_addr", $ips) && $ip_addr){

            $access["$ip_addr"] =
               "'$name' was accessed on $month $day,
               $year by $htuser from $hosts[$ip_addr]
               $ip_addr.\n\n";

            array_push($ips, "$ip_addr");
         }
         
      }
      $line = rtrim(fgets($LOG, 4096));
   }
   fclose($LOG);
PHP closed out the data text file containing the list of Web pages earlier, but it still has that information stored in the associative array $pages. To retrieve that information, we're using a foreach statement above. It will go through each data pair and extract the page name and the file name, placing them temporarily in the variables $page_name and $file_name, respectively. PHP will hang onto that information for use in this section of code in which it will search the Apache log and for the next section in which it will check its log to make sure that it didn't already inform us of each user access.

We start off this foreach statement block by opening the Apache log and grabbing one line of text as we did in the previous sections. We then start a while statement in which PHP will examine the line of data and if it contains information on a user accessing one of our documents, PHP will capture the data so as to prepare to email us at the end of the script.

The first line of the while statement block uses the ereg() function to determine whether the file name is contained in the line of text retrieved. If it isn't, it will skip the statement block contained in the if statement and get another line of data from the Apache log. If the line inspected does contain the file name, PHP will use the preg_match() function to pick apart the data needed for the email message.

In this preg_match() function, we're using Perl-like pattern matching. The second argument contains the string from which we're extracting data. The first argument shows the patterns contained within two forward-slashes. We're looking for a pattern like this:

12.1.1.100 - russell - [15/June/2004
Patterns within parenthesis are captured and placed in the array $matches, which is named in the third argument. To capture digits, we use \d, and to capture letters, \w. An asterisk indicates zero or more of the character type that precedes it. Everything else in the pattern shown above equals the actual characters that PHP should find. So that PHP won't be confused by the forward slashes in the pattern, they are escaped with a back slash, meaning PHP actually should look for a forward slash in the string.

To divide the results of preg_match() among the variables in the lines that follow it, we use a simple sequential array data access method: $matches[n]. Following the variables, PHP checks whether there is a value in the variable $htuser using the isset() function. We'll always have a user name with a password-protected directory, but we include this feature in case we want to add public files to our list. If no user name is given when the file was accessed, the variable won't be set, so PHP will set it to Anonymous.

Before storing the information we just retrieved in a temporary associative array (i.e., $access) and moving onto the next line in the Apache log, PHP checks whether we've already stored it in $access (described in the next paragraph). We use the function in_array() to see whether it is in the array. The search parameter -- or rather the IP address -- is given as the first argument; the array ($ips), which contains a list of IP addresses in which PHP matched during the running of the script, is given as the second argument.

If this is a first time during this script that this IP address was found to have accessed this document, then we'll add a line of text that says the document for which PHP searched was accessed by the specified user from the IP address found, along with a name for the host if known. That text is stored in the associative array called $access, which will be keyed on the IP address. PHP then records the IP address in the array $ips using the array_push() function. It will use this array to make sure that it doesn't email us twice on the same address in the same message. The rest of the code section above closes out like the previous sections.

Keeping a Log

Let's move on to the next section of code in which PHP will record its findings in its log and get the email message ready. Keep in mind, though, that the last squiggly bracket of this section is from the foreach statement from the previous section of code. That is to say, PHP is still processing its search of one document. Once it has recorded its findings to its log and saved the information in a string for mailing, it will search the log again for another document:

   $RECORD = fopen("$ck_log", 'a')
             or die("Could not open PHP log.");

   foreach ($access as $ip_add => $acc_info) {
         $parameter = "$page_name|$file_name|$ip_add";
         $results = shell_exec("grep -cs '$parameter' $php_log");

         if($results == 0) {
            fputs($RECORD, "$parameter\n");

            $message = $message . "$acc_info";
         }
   }
   fclose($RECORD);
}
For the file handle in this section, we are opening the file for appending, hence the "a" in the second argument of the fopen()function. To loop through each message temporarily stored in the associative array $access, we'll use foreach again. To do this, we'll reconstruct the pattern in which we saved the data in the log php-ck-log.txt and save that pattern in the variable $pattern. To keep it simple, we will next execute a system command grep using the exec() function. We're issuing the -c option of grep to count the number occurrences of the pattern in the log. With the -s option, we suppress any error messages from grep. In the next line of code, we use an if statement to check whether the result of the grep is zero, meaning there are no entries matching the pattern. If there are no entries, the first line of the block for the if statement contains an fputs() function, which will write to the log file the pattern that was given a few lines of code above this line.

The final line of this if statement stores the information in the variable $acc_info in a new variable called simply $message. Actually, this line appends $acc_info to the end of whatever is already in the variable $message, which will be empty the first run through. Once PHP is finished looping through the list of Web pages for which it is to search, it closes out its log file and the foreach statement of earlier.

Wrapping It Up

The script is now ready to send us an email, so let's look at the final section of code:

    if($message) {
       mail($to, "Web Log", $message, $from);
    }

    exit();

?>
We use an if statement to see whether the variable $message contains anything. If it does, PHP uses the mail() function to mail us the contents. The variable containing the email address in which to send the message is given in the first argument of the mail() function. The second argument is the text that will go in the subject line of the message. The next argument is the message, and the last argument is the address from which it comes.

Conclusion

This script could be written a little tighter, but it gives you some examples of PHP in action and an example of a PHP script that isn't run through the usual Web interface. The language is fairly straightforward so that it may be easily maintained and improved upon by various levels of programmers. It's also a cool use of PHP -- it can be pretty impressive when you're sitting at a coffee house talking to a friend and you receive an email on your cell phone saying someone has just looked at a particular document of yours.

Russell Dyer is a Perl programmer, a MySQL developer, and a Web designer living and working on a consulting basis in New Orleans. He is also an adjunct instructor at a technical college where he teaches Linux and other open source software. He can be reached at: russell@dyerhouse.com.