Scanning
Apache Logs with PHP
Russell J.T. Dyer
On my Web site, I have a few key Web pages located in a directory
requiring user authentication. For some documents, though, I want
to know when they are accessed and who accessed them. For instance,
I might put a business proposal in a private directory and send
emails to several prospects asking them to read my proposal. So
I can learn whether the document was viewed and which prospects
viewed it, a PHP script scans my Apache access log regularly and
sends an email to my cell phone telling me if it discovers a match.
In this installment of my series on PHP, I will describe this PHP
script to explain a little PHP programming and to give you some
ideas on how PHP might be used for systems administration and log
monitoring.
Getting Started
This PHP script won't be run through a browser, but will be executed
by cron. The entry in crontab looks like this on my
Linux server:
0,15,30,45 * * * * root /sbin/ck-apache-log.php
The opening four numbers separated by commas are the minute settings.
The first asterisk that follows is a wildcard that means "every hour".
So on the hour, as well as 15, 30, and 45 minutes after the hour,
the specified script will be run by the root user. The other
three asterisks represent days of the month, months of the year, and
days of the week, respectively. That means this script will run every
day.
The opening code for the script ck-apache-log.php follows.
Because this script will not generate a Web page display, we need
to give the opening sha-bang (#!) along with the path for
php (which may be different on your server), and the -q
option to prevent PHP from involving the Web server:
#!/usr/bin/php -q
<?php
$dir = '/var/log/httpd';
$log = "$dir/access_log";
$ck_log = "$dir/php-ck-log.txt";
$ck_list = "$dir/php-ck-list.txt";
$ck_iplist = "$dir/php-ck-iplist.txt";
$from = 'root@dyerhouse.com';
$to = 'russell@dyerhouse.com';
$access = array();
$ips = array();
The first variable $dir contains the path to where our logs
and other files are to be stored. The variable $log provides
the name of the Apache log that PHP will be scanning. The variable
$ck_log contains the name of the log in which PHP will record
information on any user accesses that are found in the Apache log.
The variable $ck_list names the data text file that will have
a list of files for which we want our PHP script to search the Apache
log. As for $ck_iplist, it contains a list of IP addresses
with which we are familiar. This will be used to give a better display
in the email messages that will be sent to us. The next pair of lines
in our script establishes the variables that contain the email address
of the server and the address of the person whom PHP will email. The
last pair sets up two arrays, which will be filled with data later.
Pages to Watch
Setting aside the PHP script for a moment, let's set up a text
file that will contain a list of pages for which PHP is to search.
This will be a simple text file that we'll create with a text editor
(like vi) in the /var/log directory and name php-ck-list.txt:
Business Proposal|business-proposal.html
Sales Plan|sales-plan.html
It's just a simple data text file with each record on a separate line.
Each record only contains two fields of data separated by a vertical
bar: the first is the name of the document and the second field is
the file name. For simplicity, I've only listed two documents. Getting
back to our PHP script, let's look at the next section of code in
which PHP will read this text file and retrieve its data:
$PAGES = fopen("$ck_list", 'r')
or die("Could not open page listing file.");
$line = rtrim(fgets($PAGES, 4096));
while(!feof($PAGES))
{
list($name,$file_name) = split('\|', $line, 2);
$pages[$name] = $file_name;
$line = rtrim(fgets($PAGES, 4096));
}
fclose($PAGES);
In the first line of code above, we're establishing the file handle
$PAGES that will be used to read each line of text from the
data file that we just discussed. We're using the fopen() function
to open the file specified in its first argument and in read mode,
per the second argument given. If unsuccessful in opening the file,
per the or component wrapped onto the next line, the script
will die and display the error message we've provided. The next line
uses the fgets() function to get the first line of data (i.e.,
4k of data). Before storing that data temporarily in the variable
$line, we use the rtrim() function to trim off the right-most
character, the line feed.
PHP will now loop through each line of the data text file as long
as we're not at the end of the file. This is accomplished with the
feof() function coupled with the exclamation point as a negator.
In the first line of the loop statement, we use the split()
function to split out the Web page's name and it's file name, based
on the vertical bar that separates them. The list() function
will capture those values and store them temporarily in the variables
named. Next, PHP stores these values in an associative array for
retrieval later in the script. We end the while statement
by retrieving another line from the file before starting the process
over. When we reach the end of the text file, we use fclose()
to close the file and to drop the file handle.
Familiar Hosts
The next task is to get a list of IP addresses with which we are
already familiar. These are the IP addresses for the people that
we emailed asking to look at our documents. Without this list, PHP
would only be able to provide us with the IP address of the host
that accessed the documents. The text file php-ck-iplist.txt
is set up like php-ck-list.txt. The only difference is the
data content -- each record will contain a field containing the
expected user's name and then a vertical bar and then their server's
IP address. Below is the section of code that will extract those
records for use in the script:
$LIST = fopen("$ck_iplist", 'r')
or die("Could not open host listing file.");
$line = rtrim(fgets($LIST, 4096));
while(!feof($LIST))
{
list($host,$ip) = split('\|', $line, 2);
$hosts[$ip] = $host;
$line = rtrim(fgets($LIST, 4096));
}
fclose($LIST);
This section of code works like the previous section. The only differences
are the names of the variables and the like. We're storing the results
here in another associative array called $hosts. Incidentally,
it would be a little more of an involved script, but we could use
the host command to look up the host information on any IP
address with which we're not familiar.
Scanning Apache
Now that PHP knows the file names for which it's searching and
knows who should be hitting them, we can have PHP search the Apache
log:
foreach($pages as $page_name => $file_name) {
$LOG = fopen("$log", 'r')
or die("Could not open the Apache log.");
$line = rtrim(fgets($LOG, 4096));
while(!feof($LOG))
{
if(ereg($file_name, $line)){
preg_match('/(\d*.\d*.\d*.\d*) \- (\w*)
\[(\d*)\/(\w*)\/(\d{4})/', $line, $matches);
$ip_addr = $matches[1];
$htuser = $matches[2];
$day = $matches[3];
$month = $matches[4];
$year = $matches[5];
if(!isset($htuser)){
$htuser = 'Anonymous';
}
if(!in_array("$ip_addr", $ips) && $ip_addr){
$access["$ip_addr"] =
"'$name' was accessed on $month $day,
$year by $htuser from $hosts[$ip_addr]
$ip_addr.\n\n";
array_push($ips, "$ip_addr");
}
}
$line = rtrim(fgets($LOG, 4096));
}
fclose($LOG);
PHP closed out the data text file containing the list of Web pages
earlier, but it still has that information stored in the associative
array $pages. To retrieve that information, we're using a foreach
statement above. It will go through each data pair and extract the
page name and the file name, placing them temporarily in the variables
$page_name and $file_name, respectively. PHP will hang
onto that information for use in this section of code in which it
will search the Apache log and for the next section in which it will
check its log to make sure that it didn't already inform us of each
user access.
We start off this foreach statement block by opening the
Apache log and grabbing one line of text as we did in the previous
sections. We then start a while statement in which PHP will
examine the line of data and if it contains information on a user
accessing one of our documents, PHP will capture the data so as
to prepare to email us at the end of the script.
The first line of the while statement block uses the ereg()
function to determine whether the file name is contained in the
line of text retrieved. If it isn't, it will skip the statement
block contained in the if statement and get another line
of data from the Apache log. If the line inspected does contain
the file name, PHP will use the preg_match() function to
pick apart the data needed for the email message.
In this preg_match() function, we're using Perl-like pattern
matching. The second argument contains the string from which we're
extracting data. The first argument shows the patterns contained
within two forward-slashes. We're looking for a pattern like this:
12.1.1.100 - russell - [15/June/2004
Patterns within parenthesis are captured and placed in the array $matches,
which is named in the third argument. To capture digits, we use \d,
and to capture letters, \w. An asterisk indicates zero or more
of the character type that precedes it. Everything else in the pattern
shown above equals the actual characters that PHP should find. So
that PHP won't be confused by the forward slashes in the pattern,
they are escaped with a back slash, meaning PHP actually should look
for a forward slash in the string.
To divide the results of preg_match() among the variables
in the lines that follow it, we use a simple sequential array data
access method: $matches[n]. Following the variables, PHP
checks whether there is a value in the variable $htuser using
the isset() function. We'll always have a user name with
a password-protected directory, but we include this feature in case
we want to add public files to our list. If no user name is given
when the file was accessed, the variable won't be set, so PHP will
set it to Anonymous.
Before storing the information we just retrieved in a temporary
associative array (i.e., $access) and moving onto the next
line in the Apache log, PHP checks whether we've already stored
it in $access (described in the next paragraph). We use the
function in_array() to see whether it is in the array. The
search parameter -- or rather the IP address -- is given as the
first argument; the array ($ips), which contains a list of
IP addresses in which PHP matched during the running of the script,
is given as the second argument.
If this is a first time during this script that this IP address
was found to have accessed this document, then we'll add a line
of text that says the document for which PHP searched was accessed
by the specified user from the IP address found, along with a name
for the host if known. That text is stored in the associative array
called $access, which will be keyed on the IP address. PHP
then records the IP address in the array $ips using the array_push()
function. It will use this array to make sure that it doesn't email
us twice on the same address in the same message. The rest of the
code section above closes out like the previous sections.
Keeping a Log
Let's move on to the next section of code in which PHP will record
its findings in its log and get the email message ready. Keep in
mind, though, that the last squiggly bracket of this section is
from the foreach statement from the previous section of code.
That is to say, PHP is still processing its search of one document.
Once it has recorded its findings to its log and saved the information
in a string for mailing, it will search the log again for another
document:
$RECORD = fopen("$ck_log", 'a')
or die("Could not open PHP log.");
foreach ($access as $ip_add => $acc_info) {
$parameter = "$page_name|$file_name|$ip_add";
$results = shell_exec("grep -cs '$parameter' $php_log");
if($results == 0) {
fputs($RECORD, "$parameter\n");
$message = $message . "$acc_info";
}
}
fclose($RECORD);
}
For the file handle in this section, we are opening the file for appending,
hence the "a" in the second argument of the fopen()function.
To loop through each message temporarily stored in the associative
array $access, we'll use foreach again. To do this,
we'll reconstruct the pattern in which we saved the data in the log
php-ck-log.txt and save that pattern in the variable $pattern.
To keep it simple, we will next execute a system command grep
using the exec() function. We're issuing the -c option
of grep to count the number occurrences of the pattern in the
log. With the -s option, we suppress any error messages from
grep. In the next line of code, we use an if statement
to check whether the result of the grep is zero, meaning there
are no entries matching the pattern. If there are no entries, the
first line of the block for the if statement contains an fputs()
function, which will write to the log file the pattern that was given
a few lines of code above this line.
The final line of this if statement stores the information
in the variable $acc_info in a new variable called simply
$message. Actually, this line appends $acc_info to
the end of whatever is already in the variable $message,
which will be empty the first run through. Once PHP is finished
looping through the list of Web pages for which it is to search,
it closes out its log file and the foreach statement of earlier.
Wrapping It Up
The script is now ready to send us an email, so let's look at
the final section of code:
if($message) {
mail($to, "Web Log", $message, $from);
}
exit();
?>
We use an if statement to see whether the variable $message
contains anything. If it does, PHP uses the mail() function
to mail us the contents. The variable containing the email address
in which to send the message is given in the first argument of the
mail() function. The second argument is the text that will
go in the subject line of the message. The next argument is the message,
and the last argument is the address from which it comes.
Conclusion
This script could be written a little tighter, but it gives you
some examples of PHP in action and an example of a PHP script that
isn't run through the usual Web interface. The language is fairly
straightforward so that it may be easily maintained and improved
upon by various levels of programmers. It's also a cool use of PHP
-- it can be pretty impressive when you're sitting at a coffee house
talking to a friend and you receive an email on your cell phone
saying someone has just looked at a particular document of yours.
Russell Dyer is a Perl programmer, a MySQL developer, and a
Web designer living and working on a consulting basis in New Orleans.
He is also an adjunct instructor at a technical college where he
teaches Linux and other open source software. He can be reached
at: russell@dyerhouse.com. |