jun2005.tar

Cleaning Up Large Mailing Lists: Removing Bad Addresses

Jeff Bennett

Supporting corporate Web sites, especially retail ones, often includes administering mail servers that perform regular mailings to large customer mailing lists. Managers are generally motivated to increase the size of these lists by any means necessary, as this is a good way to increase the customer base. In theory, the more email addresses you mail to, the more customers you have. In practice, this often means "encouraging" the people who visit your Web site to register before they can continue onto more desirable site functions -- a practice not well loved by all end users.

The Problem

While the marketing department of any firm may be pleased to have accumulated a mailing list of 500,000 email addresses for their weekly mailing, this joy is not always shared by the systems administrator. Besides the minor burden of efficiently managing a huge send every week, the main headache is the fact that many of the addresses collected will be bogus. When an annoyed Web surfer inputs "bobsyeruncle@nodomain.all" as his email address in the registration process, the effect of one bad address added to the mailing list is less than minimal. But what if 10,000 people do that each week?

There's is no limit to the imagination that goes into creating bogus email addresses; however, the amusement wears off quickly when they start clogging up your system. Besides lengthening the duration of your sends, the bounces come back by the thousands, and the stern emails start to arrive from other postmasters, both automated and human, informing you that rules are being broken. Even if you are emailing only to solicited customers, you may find yourself on spam lists and blacklists if you are sending thousands of phantom messages that do nothing more than take up bandwidth and machine resources as they are processed and passed back and forth.

The responsibility of the sys admin in this situation is to clean up the mailing list periodically. While the marketing team may not be thrilled to hear that there are 80,000 or so bad addresses in the current mailing list, they should recognize the necessity of cleaning them out. You can soften the blow a little by running your cleanup script more frequently, depending on the rate at which the master mailing list increases in size, and the rate at which your send performance degrades.

The Solution

The criteria for weeding out "bad" addresses should not be taken lightly. There are many reasons for a message to fail in reaching its destination, and many of them are not indicative of a nonexistent or inactive email address. It may be quick and easy to resort to sendmail's maillog to try to determine which addresses are bad, but not enough information is presented there to make an intelligent decision. Sendmail itself has guidelines for managing bounces (see Costales, Sendmail, 3rd Ed., pp. 516-517), but when it comes to customized extraction to discover bad addresses, you are on your own. I have found that an efficient and minimally complex way to prune a mailing list is via a script that crawls through the bounces and extracts the bad addresses, so that they can be removed from your company's customer list.

The best way to ensure that you are purging addresses from a list with some certainty that they are actually "bad" is to use the SMTP or Enhanced SMTP (ESMTP) error codes (RFC 1893), and these exist only in the mail header of the bounced mail. Of the many different ESMTP codes that exist, these are the ones I deemed to represent a bad (i.e., nonexistent) address:

5.0.0 Service unavailable (this is equal to an SMTP 554 
  protocol error, recipient address rejected)
5.1.1 User unknown
5.1.2 Host unknown
5.1.3 Domain not allowed
5.1.6 Destination address (user) unknown
5.1.8 User unknown

The criteria for an address cleanup may vary depending on the size and type of send being done. For example, I thought about purging addresses for which I got the 5.2.1 "mailbox disabled" or 5.2.2 "mailbox is full" errors for three consecutive weeks, but this would require a completely separate process and a new script -- I put this on my to-do list as a future enhancement of this project.

Having determined my desired result (the purging of addresses that caused mail to bounce with the aforementioned error codes), I faced the task of searching the mail file of the recipient of the bounces for two things:

1. The pieces of mail with the error codes in question, and

2. The original destination address that caused this error.

A bounced piece of email consists of the original mail in its entirety (original header and all), with a new header attached during the return process. It is this new header that provides the reason for the return along with a few other pieces of information. A simple mail header for a successful sent message looks something like this:

Received: from mailhost.relaydomain.com (mailhost.relaydomain.com 
  [192.192.192.x]) by mailhost.immense-isp.com (8.8.5/8.7.2) with 
  ESMTP id AAA34567 for <yourbuddy@immense-isp.com>; Tue, 18 Sep 
  2004 14:39:24 -0800 (PST)
Received: from mailhost.yourdomain.com (mailhost.yourdomain.edu 
  [124.124.124.x]) by mailhost.relaydomain.com (8.8.5) id BBB123; 
  Tue, Sep 18 2004 14:36:17 -0800 (PST)
From: you@yourdomain.com
To: yourbuddy@immense-isp.com
Date: Tue, Sep 18 2004 14:36:14 PST
Message-Id: <you033456712345-00000123@mailhost.yourdomain.com >
Subject: Lunch today?
MIME-Version: 1.0
Content-Type: multipart/report;

The header of a bounced piece of mail is considerably more complex and lengthy, and so much of it is irrelevant that including a hundred-line example here would not add any clarity to our task. Suffice to say that as each host handles a message (and it may be passed around a bit before it finds its way back to you), it adds header info. Included in the superfluous information may be content comments, mail program information, and automatically inserted comments from the postmaster of any of the hosts. Often there will be multiple clues as to why this piece of mail has returned to your mailbox, but only one line is necessary, and this is not duplicated by any host other than the one that rejects the mail:

Status: 5.0.0

Now that I've gotten to the crux of the matter, it should just be a simple grep command to find the "Status: " line and we're all set, right? Close, but there are still a couple of things to do. Fortunately, the aforementioned status line always comes with a couple of preceding lines attached:

Final-Recipient: bobsyeruncle@nodomain.net
Action: failed
Status: 5.0.0

Now I have the information I need to clean up my mailing list: the error code and the address. From here, it's a relatively simple task to locate this error code and then crawl up a couple of lines to get to the address line, and extract it to a file for each of my error codes. I put these into six separate files for ease of retrieval in case I am asked for further information at a later date.

Listing 1 contains the script, which is split into two parts. The first part does the extraction from the mailfile. The second part of the script removes extraneous information from my six output files so that they include only the addresses in a list form. When initially extracted, the lines in the tmp files are in one of two formats as they came in the mail header, for example, this:

Final-Recipient: rfc822; maniac@netcourrier.com

or this:

Final-Recipient: maniac@netcourrier.com

Either way, I need to get rid of everything but the address, which I do by reversing the order of the fields and dropping off everything after the first. One Perl shift() gets this done once I split() and reverse() the line. Here is a before-subroutine and after-subroutine look at the extracted lists.

Before:

Final-Recipient: rfc822; taniac@netconuver.com
Final-Recipient: jb@newnet.org
Final-Recipient: rfc822; funyguy@severe.net
Final-Recipient: joeblow@garbage.com
Final-Recipient: rfc822; jeneral@fred.net
Final-Recipient: rfc822; johnnygoode@comman.doh
Final-Recipient: rfc822; zerbil@todd.ca
Final-Recipient: goboy@betyeruncle.biz
Final-Recipient: rfc822; saliva@spit.net
Final-Recipient: rfc822; tenspot@sawbuck.com
Final-Recipient: rfc822; anybody@hetmail.com
Final-Recipient: rfc822; yermom@yerhouse.co.uk
Final-Recipient: geffen@deuhland.bel

After:

taniac@netconuver.com
jb@newnet.org
funyguy@severe.net
joeblow@garbage.com
jeneral@fred.net
johnnygoode@comman.doh
zerbil@todd.ca
goboy@betyeruncle.biz
saliva@spit.net
tenspot@sawbuck.com
anybody@hetmail.com
yermom@yerhouse.co.uk
geffen@deuhland.bel

Armed with the six lists produced by this script (I would generally add a date stamp to the title of each so they are not overwritten), I have the information the mail list administrator needs to purge the master list. While this could be run as a weekly or bi-weekly cron, I've found such frequency to be unnecessary, and I run it manually every two months. How best to present the information to the person or team that manages the master mailing list depends on the situation -- for me, it's a simple html report that I generate and email with a second script. For others, the task of managing the master list may fall to the sys admin in his or her role as postmaster.

References

Blank-Edelman, David. 2000. Perl for System Administration. Sebastopol, CA: O'Reilly & Associates.

Costales, Bryan with Allman, Eric. 2003. Sendmail, 3rd Ed. Sebastopol, CA: O'Reilly & Associates.

Jeff Bennett has been a systems administrator for six years, focusing mainly on Web and retail systems running Solaris, AIX, and Linux. Currently working on a consulting basis in Toronto, he also occasionally teaches Unix administration (Solaris certification track) at several Toronto technical institutions. He can be reached at selvasys@rogers.com.