Cleaning Up Large Mailing Lists: Removing Bad Addresses
Jeff Bennett
Supporting corporate Web sites, especially retail ones, often
includes administering mail servers that perform regular mailings
to large customer mailing lists. Managers are generally motivated
to increase the size of these lists by any means necessary, as this
is a good way to increase the customer base. In theory, the more
email addresses you mail to, the more customers you have. In practice,
this often means "encouraging" the people who visit your Web site
to register before they can continue onto more desirable site functions
-- a practice not well loved by all end users.
The Problem
While the marketing department of any firm may be pleased to have
accumulated a mailing list of 500,000 email addresses for their
weekly mailing, this joy is not always shared by the systems administrator.
Besides the minor burden of efficiently managing a huge send every
week, the main headache is the fact that many of the addresses collected
will be bogus. When an annoyed Web surfer inputs "bobsyeruncle@nodomain.all"
as his email address in the registration process, the effect of
one bad address added to the mailing list is less than minimal.
But what if 10,000 people do that each week?
There's is no limit to the imagination that goes into creating
bogus email addresses; however, the amusement wears off quickly
when they start clogging up your system. Besides lengthening the
duration of your sends, the bounces come back by the thousands,
and the stern emails start to arrive from other postmasters, both
automated and human, informing you that rules are being broken.
Even if you are emailing only to solicited customers, you may find
yourself on spam lists and blacklists if you are sending thousands
of phantom messages that do nothing more than take up bandwidth
and machine resources as they are processed and passed back and
forth.
The responsibility of the sys admin in this situation is to clean
up the mailing list periodically. While the marketing team may not
be thrilled to hear that there are 80,000 or so bad addresses in
the current mailing list, they should recognize the necessity of
cleaning them out. You can soften the blow a little by running your
cleanup script more frequently, depending on the rate at which the
master mailing list increases in size, and the rate at which your
send performance degrades.
The Solution
The criteria for weeding out "bad" addresses should not be taken
lightly. There are many reasons for a message to fail in reaching
its destination, and many of them are not indicative of a nonexistent
or inactive email address. It may be quick and easy to resort to
sendmail's maillog to try to determine which addresses are bad,
but not enough information is presented there to make an intelligent
decision. Sendmail itself has guidelines for managing bounces (see
Costales, Sendmail, 3rd Ed., pp. 516-517), but when it comes
to customized extraction to discover bad addresses, you are on your
own. I have found that an efficient and minimally complex way to
prune a mailing list is via a script that crawls through the bounces
and extracts the bad addresses, so that they can be removed from
your company's customer list.
The best way to ensure that you are purging addresses from a list
with some certainty that they are actually "bad" is to use the SMTP
or Enhanced SMTP (ESMTP) error codes (RFC 1893), and these exist
only in the mail header of the bounced mail. Of the many different
ESMTP codes that exist, these are the ones I deemed to represent
a bad (i.e., nonexistent) address:
5.0.0 Service unavailable (this is equal to an SMTP 554
protocol error, recipient address rejected)
5.1.1 User unknown
5.1.2 Host unknown
5.1.3 Domain not allowed
5.1.6 Destination address (user) unknown
5.1.8 User unknown
The criteria for an address cleanup may vary depending on the size
and type of send being done. For example, I thought about purging
addresses for which I got the 5.2.1 "mailbox disabled" or 5.2.2 "mailbox
is full" errors for three consecutive weeks, but this would require
a completely separate process and a new script -- I put this on my
to-do list as a future enhancement of this project.
Having determined my desired result (the purging of addresses
that caused mail to bounce with the aforementioned error codes),
I faced the task of searching the mail file of the recipient of
the bounces for two things:
1. The pieces of mail with the error codes in question, and
2. The original destination address that caused this error.
A bounced piece of email consists of the original mail in its
entirety (original header and all), with a new header attached during
the return process. It is this new header that provides the reason
for the return along with a few other pieces of information. A simple
mail header for a successful sent message looks something like this:
Received: from mailhost.relaydomain.com (mailhost.relaydomain.com
[192.192.192.x]) by mailhost.immense-isp.com (8.8.5/8.7.2) with
ESMTP id AAA34567 for <yourbuddy@immense-isp.com>; Tue, 18 Sep
2004 14:39:24 -0800 (PST)
Received: from mailhost.yourdomain.com (mailhost.yourdomain.edu
[124.124.124.x]) by mailhost.relaydomain.com (8.8.5) id BBB123;
Tue, Sep 18 2004 14:36:17 -0800 (PST)
From: you@yourdomain.com
To: yourbuddy@immense-isp.com
Date: Tue, Sep 18 2004 14:36:14 PST
Message-Id: <you033456712345-00000123@mailhost.yourdomain.com >
Subject: Lunch today?
MIME-Version: 1.0
Content-Type: multipart/report;
The header of a bounced piece of mail is considerably more complex
and lengthy, and so much of it is irrelevant that including a hundred-line
example here would not add any clarity to our task. Suffice to say
that as each host handles a message (and it may be passed around a
bit before it finds its way back to you), it adds header info. Included
in the superfluous information may be content comments, mail program
information, and automatically inserted comments from the postmaster
of any of the hosts. Often there will be multiple clues as to why
this piece of mail has returned to your mailbox, but only one line
is necessary, and this is not duplicated by any host other than the
one that rejects the mail:
Status: 5.0.0
Now that I've gotten to the crux of the matter, it should just be
a simple grep command to find the "Status: " line and we're
all set, right? Close, but there are still a couple of things to do.
Fortunately, the aforementioned status line always comes with a couple
of preceding lines attached:
Final-Recipient: bobsyeruncle@nodomain.net
Action: failed
Status: 5.0.0
Now I have the information I need to clean up my mailing list: the
error code and the address. From here, it's a relatively simple task
to locate this error code and then crawl up a couple of lines to get
to the address line, and extract it to a file for each of my error
codes. I put these into six separate files for ease of retrieval in
case I am asked for further information at a later date.
Listing 1 contains the script, which is split into two parts.
The first part does the extraction from the mailfile. The second
part of the script removes extraneous information from my six output
files so that they include only the addresses in a list form. When
initially extracted, the lines in the tmp files are in one of two
formats as they came in the mail header, for example, this:
Final-Recipient: rfc822; maniac@netcourrier.com
or this:
Final-Recipient: maniac@netcourrier.com
Either way, I need to get rid of everything but the address, which
I do by reversing the order of the fields and dropping off everything
after the first. One Perl shift() gets this done once I split()
and reverse() the line. Here is a before-subroutine and after-subroutine
look at the extracted lists.
Before:
Final-Recipient: rfc822; taniac@netconuver.com
Final-Recipient: jb@newnet.org
Final-Recipient: rfc822; funyguy@severe.net
Final-Recipient: joeblow@garbage.com
Final-Recipient: rfc822; jeneral@fred.net
Final-Recipient: rfc822; johnnygoode@comman.doh
Final-Recipient: rfc822; zerbil@todd.ca
Final-Recipient: goboy@betyeruncle.biz
Final-Recipient: rfc822; saliva@spit.net
Final-Recipient: rfc822; tenspot@sawbuck.com
Final-Recipient: rfc822; anybody@hetmail.com
Final-Recipient: rfc822; yermom@yerhouse.co.uk
Final-Recipient: geffen@deuhland.bel
After:
taniac@netconuver.com
jb@newnet.org
funyguy@severe.net
joeblow@garbage.com
jeneral@fred.net
johnnygoode@comman.doh
zerbil@todd.ca
goboy@betyeruncle.biz
saliva@spit.net
tenspot@sawbuck.com
anybody@hetmail.com
yermom@yerhouse.co.uk
geffen@deuhland.bel
Armed with the six lists produced by this script (I would generally
add a date stamp to the title of each so they are not overwritten),
I have the information the mail list administrator needs to purge
the master list. While this could be run as a weekly or bi-weekly
cron, I've found such frequency to be unnecessary, and I run it manually
every two months. How best to present the information to the person
or team that manages the master mailing list depends on the situation
-- for me, it's a simple html report that I generate and email with
a second script. For others, the task of managing the master list
may fall to the sys admin in his or her role as postmaster.
References
Blank-Edelman, David. 2000. Perl for System Administration.
Sebastopol, CA: O'Reilly & Associates.
Costales, Bryan with Allman, Eric. 2003. Sendmail, 3rd
Ed. Sebastopol, CA: O'Reilly & Associates.
Jeff Bennett has been a systems administrator for six years,
focusing mainly on Web and retail systems running Solaris, AIX,
and Linux. Currently working on a consulting basis in Toronto, he
also occasionally teaches Unix administration (Solaris certification
track) at several Toronto technical institutions. He can be reached
at selvasys@rogers.com. |