jun2005.tar

Spam Graphing and Logging for SpamAssassin Rule Optimization

James Mikusi

During my tenure as a systems administrator, I've noticed that admins fall into two disparate groups based on how they approach a problem. The first group aggressively works toward a solution and closure to the problem, trying any potential change that might make the fix. The other group works more methodically, making calculated adjustments and reversible changes. I've come to appreciate both groups, especially the former when it's important to just "get the job done", but getting a grip on spam requires the more deterministic approach. Counting and graphing your spam, for example, can help you see just how big your problem might be and how best to attack it.

This article details how to gather statistics on mail that is filtered through SpamAssassin and how to plot those numbers with MRTG. This project began when I decided to learn exactly how much spam I received in a given period; it grew when I found some oddities in the SpamAssassin rules that matched most frequently. I should add that when I began this project I had already invested considerable time tuning SpamAssassin's Bayesian database. In my opinion, this remains one of the strongest defenses against spam on a per-user basis, because what is spam to you is not necessarily spam to your neighbor. Thus, teaching SpamAssassin to recognize what's spam to you is important.

On that note, you also should be aware that the implementation described is designed for a single user. The scripts could easily be edited for use at the domain level. However, the objectives here are to tune SpamAssassin, which is difficult to do, and to make global assumptions about what hundreds of users might concur is spam. The methods described increase the effectiveness of Bayes filtering by finding out which rules are triggered most often. This is done by counting incoming spam and graphing the numbers.

Two direct dependencies are used in this article's features -- SpamAssassin and MRTG, both depending on Perl. Both packages are included with almost every Linux distribution, thus their installation will not be covered here. The projects' Web sites (see the References) contain thorough documentation as well. A potential, third dependency might be procmail, but your favorite local mail agent can be used to filter incoming mail through SpamAssassin. I like procmail and will describe how I used it.

Getting the Statistics

The first step in implementing this spam control suite is having your incoming mail filtered through SpamAssassin before delivery. This is where I use procmail. The following line at the start of your .procmailrc file in your home directory will pipe mail through SpamAssassin:

:0fw
| /usr/bin/spamc

This use depends on having the spamd daemon running, which I highly recommend for efficiency. If, for any reason, running the daemon doesn't suit you, mail can alternatively be piped to /usr/bin/spamassassin, but this setup will spawn a different perl/spamassassin process for each mail. My home mail server runs fetchmail to get 10 mails per call, which would bring this machine, a PII-350, to its knees if called in the latter manner.

This setup alone will do SpamAssassin's default actions and tag your mail headers and prepend the mail's subject line with SpamAssassin's default "***SPAM***." While these tags are useful to end users, the utilities of this article depend on the X-Spam-Flag mail header, which contains a Yes/No spam assertion and SpamAssassin's score based on its scoring rules. We'll make use of these features by asking procmail to do a few more things with our mail.

Although it might seem odd, we're going to filter the mail through SpamAssassin a second time, but this time the custom script this article features makes use of the Perl module Mail::SpamAssassin::NoAudit, which doesn't deal with the full overhead of SpamAssassin. The next release of this project will likely eliminate this duality, so check for updates. The following should appear next in .procmailrc:

:0c
| .spamassassin/bin/spamassassin_stats.pl

Also, note that the following procmail recipe was an early implementation of this tool and worked quite well, but then these responsibilities got snarfed into the above script for the sake of consolidation. It nicely creates two counter files and delivers mail to a spam mbox file and non-spam mail as usual:

:0
^X-Spam-Flag:.*YES.*
{
            # deliver to spam mbox file AND incr spam counter file
            :0 c
            | echo -n . >> .spamassassin/count.spam
            :0
            mail/spam
}

:0c
| echo -n . >> .spamassassin/count.ham

The spamassassin_stats.pl script uses the ~/.spamassassin/stats directory to keep its count (see Listing 1 at http://www.sysadminmag.com). There are two files, named counts.spam and counts.ham, which tally their respective mail types. Additionally, there are two files to keep track of SpamAssassin scores (scores.spam and scores.ham) and two directories (named "spam" and "ham"). These directories hold some interesting statistics -- a file named for each SpamAssassin rule matched with its size being the count of matches. Thus, a simple ls -lS | head in the spam/ or ham/ subdirectories can quickly show most common characteristics in your spam. This feature alone may suit some admins who just want to quickly see some numbers, but the graphing used by MRTG really adds some nice documentation of spam abuse. Another quick option is to point your Web server to this stats directory (assuming directory listings are permitted). Apache has linked column headers, which sort for that specific column. Use this to sort your stats.

Graphing the Stats with MRTG

As with the common use of MRTG, the mrtg binary should be run about every 20 to 30 minutes via cron, but we'll be using a custom config file named .spamassassin/stats/mrtg/spamcount.cfg (see Listing 2 at http://www.sysadminmag.com). This will be the only required argument to mrtg in your cron entry:

7/37 * * * * /usr/bin/mrtg $HOME/mrtg/spam/spamstats.cfg

Depending on your influx of mail, it might be beneficial to reduce this frequency to dramatize your graphs.

The spamstats.cfg file can be extended to create as many graphs as you need, but the file used here just graphs incoming spam counts and the percentage of mail that is spam. The reality of these graphs may be surprising. I was shocked and disappointed to discover that I get more than 90% spam!

If you're familiar with MRTG, you probably know it can quickly be configured to graph port traffic from your routers or switches, as it was designed to do. However, it can also be extended to graph almost anything. By default, MRTG queries a router and expects four lines in return, of which the first two are the counts of inbound and outbound bytes, and the second two are the sysUptime and sysName MIB entries. The first two lines are completely arbitrary and can be used to represent anything. The scripts called via spamstats.cfg do just this. They get the numbers via file size in the stats directory tree and return them to MRTG -- almost too easy.

The initial versions of these scripts also maintained overhead of keeping track of the counter files and clearing them periodically, but as it turns out, MRTG takes care of maintaining a database and has features to reset counters. Whether you're using RRD (Round Robin Database, a preferred logging mechanism for MRTG) or MRTG's default text database scheme, MRTG does all the work of keeping track of historical data. This is done by integrating new data into historical averages.

From the perspective of MRTG, this is all that's needed to create the Yearly, Monthly, and Weekly graphs. If more detailed historical data is desired, it can easily be maintained by a few edits to these scripts. However, the counter files do need to be periodically reset. The ThreshMaxI and ThreshProgI MRTG configuration options lets us set a counter threshold and program to reset the values, respectively. Just like your switch's counter registers reset when it hits the ceiling of a 32-bit register, we'll do the same. We'll set the magic number to 1024 because a default ext2 filesystem makes use of a 4K block size. This is the number to which we'll configure ThreshMaxI and ThreshMaxO to respond.

To finish the presentation, we'll use indexcfgmaker, a Perl script that's part of the MRTG distribution. We can feed this script the spamstats.cfg MRTG config file as an argument, and it'll generate appropriate html for an index.html file containing a list of all the monitored objects in tabular format with the five-minute averages graphs. This provides a quick overview of the current status. Clicking on any graph will take you to that monitored object's full page with the Weekly, Monthy, and Yearly graphs.

Tuning SpamAssassin for Better Filtering

Now that we can "see" our spam from a higher perspective, SpamAssassin can be tuned for better filtering. The default values that SpamAssassin gives to rules are configured in /etc/mail/spamassassin/local.cf. When I first began filtering my mail with these scripts, I was surprised to see how many mails scored higher than the Bayesian 90th percentile. By increasing the weight of frequent culprits in my .spamassassin/user_prefs file, I also increased the number of mails matched above the 90th percentile. Likewise, if you find you never get any non-spam mail hitting above the 30th Bayesian percentile, you can comfortably set the Bayesian watermark to 70 instead of the default of 99. Here are some of my .spamassassin/user_prefs:

# score adjustments
score DATE_IN_FUTURE_03_06              5.0
score INVALID_DATE                      3.5
score DOMAIN_SUBJECT                    2.5

# trigger and bayesian learning thresholds
required_hits                           3.5
auto_learn_threshold_spam               7

The roots of this project began with filtering my personal mail, and I have been continually tempted to try these utilities at the server level (I haven't yet). However, it seems most anti-spam whitepapers emphasize the point that Bayesian filtering is strongest per user. Although I would expect the graphing to be helpful at the server level, I would also anticipate that one small change to benefit one user's spam problem might create false positives for another.

Conclusion

If you've been using MRTG to track router traffic, you'll likely agree as to the convenience of seeing this information graphically. Many sys admins are already overtaxed with responsibilities, thus the more utilities we have to see what our system is doing, the better. And, while most of us pride ourselves in being able to find almost any system stat from the command line, it's undeniably helpful to have graphical tools.

An extended hope of mine is that this suite of scripts can help legislation catch up with the spam epidemic. Although spam provides a lot of job security to sys admins, I think we would all prefer to see it disappear so we could work on bigger and better things. I hope these graphs can be used to show management and politicians how badly some of us are plagued by spam and thereby losing productivity. Managers and politicians may be more receptive to statistical complaints, graphs, and pie charts than other forms of information.

References

Author Notes -- http://www.i-kong.com/projects/spamstats

MRTG -- http://people.ee.ethz.ch/~oetiker/webtools/mrtg/

Perl -- http://www.perl.org

Procmail -- http://www.procmail.org/

SpamAssassin -- http://spamassassin.apache.org/

James works as a Linux/OpenSource consultant for small/medium businesses. He enjoys the "Do It Yourself" aspect of open source software and the never disappointing ability to find some piece of software to fit almost every desire.