Spam Graphing and Logging for SpamAssassin Rule Optimization
James Mikusi
During my tenure as a systems administrator, I've noticed that
admins fall into two disparate groups based on how they approach
a problem. The first group aggressively works toward a solution
and closure to the problem, trying any potential change that might
make the fix. The other group works more methodically, making calculated
adjustments and reversible changes. I've come to appreciate both
groups, especially the former when it's important to just "get the
job done", but getting a grip on spam requires the more deterministic
approach. Counting and graphing your spam, for example, can help
you see just how big your problem might be and how best to attack
it.
This article details how to gather statistics on mail that is
filtered through SpamAssassin and how to plot those numbers with
MRTG. This project began when I decided to learn exactly how much
spam I received in a given period; it grew when I found some oddities
in the SpamAssassin rules that matched most frequently. I should
add that when I began this project I had already invested considerable
time tuning SpamAssassin's Bayesian database. In my opinion, this
remains one of the strongest defenses against spam on a per-user
basis, because what is spam to you is not necessarily spam to your
neighbor. Thus, teaching SpamAssassin to recognize what's spam to
you is important.
On that note, you also should be aware that the implementation
described is designed for a single user. The scripts could easily
be edited for use at the domain level. However, the objectives here
are to tune SpamAssassin, which is difficult to do, and to make
global assumptions about what hundreds of users might concur is
spam. The methods described increase the effectiveness of Bayes
filtering by finding out which rules are triggered most often. This
is done by counting incoming spam and graphing the numbers.
Two direct dependencies are used in this article's features --
SpamAssassin and MRTG, both depending on Perl. Both packages are
included with almost every Linux distribution, thus their installation
will not be covered here. The projects' Web sites (see the References)
contain thorough documentation as well. A potential, third dependency
might be procmail, but your favorite local mail agent can be used
to filter incoming mail through SpamAssassin. I like procmail and
will describe how I used it.
Getting the Statistics
The first step in implementing this spam control suite is having
your incoming mail filtered through SpamAssassin before delivery.
This is where I use procmail. The following line at the start of
your .procmailrc file in your home directory will pipe mail through
SpamAssassin:
:0fw
| /usr/bin/spamc
This use depends on having the spamd daemon running, which I highly
recommend for efficiency. If, for any reason, running the daemon doesn't
suit you, mail can alternatively be piped to /usr/bin/spamassassin,
but this setup will spawn a different perl/spamassassin process for
each mail. My home mail server runs fetchmail to get 10 mails per
call, which would bring this machine, a PII-350, to its knees if called
in the latter manner.
This setup alone will do SpamAssassin's default actions and tag
your mail headers and prepend the mail's subject line with SpamAssassin's
default "***SPAM***." While these tags are useful to end users,
the utilities of this article depend on the X-Spam-Flag mail header,
which contains a Yes/No spam assertion and SpamAssassin's score
based on its scoring rules. We'll make use of these features by
asking procmail to do a few more things with our mail.
Although it might seem odd, we're going to filter the mail through
SpamAssassin a second time, but this time the custom script this
article features makes use of the Perl module Mail::SpamAssassin::NoAudit,
which doesn't deal with the full overhead of SpamAssassin. The next
release of this project will likely eliminate this duality, so check
for updates. The following should appear next in .procmailrc:
:0c
| .spamassassin/bin/spamassassin_stats.pl
Also, note that the following procmail recipe was an early implementation
of this tool and worked quite well, but then these responsibilities
got snarfed into the above script for the sake of consolidation. It
nicely creates two counter files and delivers mail to a spam mbox
file and non-spam mail as usual:
:0
^X-Spam-Flag:.*YES.*
{
# deliver to spam mbox file AND incr spam counter file
:0 c
| echo -n . >> .spamassassin/count.spam
:0
mail/spam
}
:0c
| echo -n . >> .spamassassin/count.ham
The spamassassin_stats.pl script uses the ~/.spamassassin/stats directory
to keep its count (see Listing 1 at http://www.sysadminmag.com).
There are two files, named counts.spam and counts.ham, which tally
their respective mail types. Additionally, there are two files to
keep track of SpamAssassin scores (scores.spam and scores.ham) and
two directories (named "spam" and "ham"). These directories hold some
interesting statistics -- a file named for each SpamAssassin rule
matched with its size being the count of matches. Thus, a simple ls
-lS | head in the spam/ or ham/ subdirectories can quickly show
most common characteristics in your spam. This feature alone may suit
some admins who just want to quickly see some numbers, but the graphing
used by MRTG really adds some nice documentation of spam abuse. Another
quick option is to point your Web server to this stats directory (assuming
directory listings are permitted). Apache has linked column headers,
which sort for that specific column. Use this to sort your stats.
Graphing the Stats with MRTG
As with the common use of MRTG, the mrtg binary should be run
about every 20 to 30 minutes via cron, but we'll be using a custom
config file named .spamassassin/stats/mrtg/spamcount.cfg (see Listing
2 at http://www.sysadminmag.com). This will be the only required
argument to mrtg in your cron entry:
7/37 * * * * /usr/bin/mrtg $HOME/mrtg/spam/spamstats.cfg
Depending on your influx of mail, it might be beneficial to reduce
this frequency to dramatize your graphs.
The spamstats.cfg file can be extended to create as many graphs
as you need, but the file used here just graphs incoming spam counts
and the percentage of mail that is spam. The reality of these graphs
may be surprising. I was shocked and disappointed to discover that
I get more than 90% spam!
If you're familiar with MRTG, you probably know it can quickly
be configured to graph port traffic from your routers or switches,
as it was designed to do. However, it can also be extended to graph
almost anything. By default, MRTG queries a router and expects four
lines in return, of which the first two are the counts of inbound
and outbound bytes, and the second two are the sysUptime and sysName
MIB entries. The first two lines are completely arbitrary and can
be used to represent anything. The scripts called via spamstats.cfg
do just this. They get the numbers via file size in the stats directory
tree and return them to MRTG -- almost too easy.
The initial versions of these scripts also maintained overhead
of keeping track of the counter files and clearing them periodically,
but as it turns out, MRTG takes care of maintaining a database and
has features to reset counters. Whether you're using RRD (Round
Robin Database, a preferred logging mechanism for MRTG) or MRTG's
default text database scheme, MRTG does all the work of keeping
track of historical data. This is done by integrating new data into
historical averages.
From the perspective of MRTG, this is all that's needed to create
the Yearly, Monthly, and Weekly graphs. If more detailed historical
data is desired, it can easily be maintained by a few edits to these
scripts. However, the counter files do need to be periodically reset.
The ThreshMaxI and ThreshProgI MRTG configuration options lets us
set a counter threshold and program to reset the values, respectively.
Just like your switch's counter registers reset when it hits the
ceiling of a 32-bit register, we'll do the same. We'll set the magic
number to 1024 because a default ext2 filesystem makes use of a
4K block size. This is the number to which we'll configure ThreshMaxI
and ThreshMaxO to respond.
To finish the presentation, we'll use indexcfgmaker, a Perl script
that's part of the MRTG distribution. We can feed this script the
spamstats.cfg MRTG config file as an argument, and it'll generate
appropriate html for an index.html file containing a list of all
the monitored objects in tabular format with the five-minute averages
graphs. This provides a quick overview of the current status. Clicking
on any graph will take you to that monitored object's full page
with the Weekly, Monthy, and Yearly graphs.
Tuning SpamAssassin for Better Filtering
Now that we can "see" our spam from a higher perspective, SpamAssassin
can be tuned for better filtering. The default values that SpamAssassin
gives to rules are configured in /etc/mail/spamassassin/local.cf.
When I first began filtering my mail with these scripts, I was surprised
to see how many mails scored higher than the Bayesian 90th percentile.
By increasing the weight of frequent culprits in my .spamassassin/user_prefs
file, I also increased the number of mails matched above the 90th
percentile. Likewise, if you find you never get any non-spam mail
hitting above the 30th Bayesian percentile, you can comfortably
set the Bayesian watermark to 70 instead of the default of 99. Here
are some of my .spamassassin/user_prefs:
# score adjustments
score DATE_IN_FUTURE_03_06 5.0
score INVALID_DATE 3.5
score DOMAIN_SUBJECT 2.5
# trigger and bayesian learning thresholds
required_hits 3.5
auto_learn_threshold_spam 7
The roots of this project began with filtering my personal mail, and
I have been continually tempted to try these utilities at the server
level (I haven't yet). However, it seems most anti-spam whitepapers
emphasize the point that Bayesian filtering is strongest per user.
Although I would expect the graphing to be helpful at the server level,
I would also anticipate that one small change to benefit one user's
spam problem might create false positives for another.
Conclusion
If you've been using MRTG to track router traffic, you'll likely
agree as to the convenience of seeing this information graphically.
Many sys admins are already overtaxed with responsibilities, thus
the more utilities we have to see what our system is doing, the
better. And, while most of us pride ourselves in being able to find
almost any system stat from the command line, it's undeniably helpful
to have graphical tools.
An extended hope of mine is that this suite of scripts can help
legislation catch up with the spam epidemic. Although spam provides
a lot of job security to sys admins, I think we would all prefer
to see it disappear so we could work on bigger and better things.
I hope these graphs can be used to show management and politicians
how badly some of us are plagued by spam and thereby losing productivity.
Managers and politicians may be more receptive to statistical complaints,
graphs, and pie charts than other forms of information.
References
Author Notes -- http://www.i-kong.com/projects/spamstats
MRTG -- http://people.ee.ethz.ch/~oetiker/webtools/mrtg/
Perl -- http://www.perl.org
Procmail -- http://www.procmail.org/
SpamAssassin -- http://spamassassin.apache.org/
James works as a Linux/OpenSource consultant for small/medium
businesses. He enjoys the "Do It Yourself" aspect of open source
software and the never disappointing ability to find some piece
of software to fit almost every desire. |