Readership Sensitive Expiration Control for Usenet News
Yufan Hu
News administrators have always dealt with the problem of having less disk space than they would like to have for Usenet News. During the past several years, however, that problem has become much worse. The typical solutions include carrying fewer news groups or using a shorter expiration time for the groups carried. These solutions treat all groups more or less equally without considering how the groups are read. This article explores the deficiency of the traditional expiration mechanism and suggests an improved solution based on the readership of the news groups.
News Expiration Usenet news articles are continuously fed from other news servers and stored on local disks dedicated to this purpose. This disk space is usually called the news spool. The finite amount of this spool space limits the time of news articles kept in the news spool. They can only be kept for a finite period of time and then expired, meaning removed from the spool and discarded. Every news server has a mechanism for such expiration. For the most popular news server on UNIX, INN, a tool called expire does this job.
When expire starts, it checks its configuration file expire.ctl to see how the expiration job is to be done. A line like this:
*:A:7:30:60
tells expire that every article should be kept in the news spool for at least 7 days; it should stay for 30 days if the article itself does not have an "Expire:" header indicating the date it expires. If the article has such a header, it will be kept until that date, but no more than 60 days from the date it arrived in the news spool. The general format of such control rules is as follows:
pattern:A:minimum:default:purge
where "pattern" specifies a wild card pattern to match news groups. The "minimum" value indicates the minimum number of days for any article whose group name matches "pattern" to be kept in the news spool. The "default" value specifies the default number of days for the article to stay if there is no "Expire:" header indicating its expiration time. If there is an "Expire:" header in the article, then the article will be expired shortly after the indicated expiration date, but not longer than "purge" number of days since its arrival to the server.
There may be more than one such rule in the expire.ctl file. Each rule starts with a wild card pattern. This pattern is used to match group names. If a group name matches the pattern, then the minimum, default, and purge values from the rule are used to determine whether the article should be expired when the expire utility runs. expire scans through the whole expire.ctl file from the beginning to the end. The last matching rule is used for determining the minimum, default, and purge values. Thus, general control rules can be placed at the beginning of the file, and fine adjustments can be placed toward the end of the expire.ctl file. For example:
*:A:7:30:60
comp.*:A:14:60:90
determines that all groups in the "comp" hierarchy will stay for at least 14 days, default for 60 days, and be purged in 90 days. Anything else will stay for at least 7 days, default for 30 days, and be purged in 60 days. Any group that matches comp.* also matches *. But since the control rule for comp.* appears later, it is used to determine the expiration time for articles in the "comp" hierarchy.
expire.ctl Pitfalls With today's news volume, it is very difficult for many sites to keep everything as long as 30 days. Daily news volume can be as big as 20 GB. Thus, 30 days worth will require as much as 600 GB. There are two approaches commonly used by many news administrator's to cope with this problem. The first way is to reduce the number of news groups the server carries. The server then rejects all articles whose group it does not carry, thus saving the disk space needed to store these articles. The problem with this approach is that, some groups interesting to our readers may also be excluded. The second approach is to try carry as many groups as possible but not keep the articles as long on the server, so that there is more space to accept new articles. The problem with this approach is that, if the expiration time is too short, our users must be vigilant about reading the news or risk missing important postings.
Collecting Readership Data So, how can user readership information be collected? Some UNIX administrators have tried this on UNIX systems by scanning through .newsrc files in each user's home directory in order to search for subscribed groups. Although such information represents some readership on these UNIX systems, it is not accurate and is probably against privacy. A user can subscribe to a group, but then forget about it. This method also may not be very practical nowadays, because many users are non-UNIX users. They are using personal computers to read news. These users may not even have a UNIX account and may not leave a trace of their subscribed groups on the UNIX system even if they have one. Due to the heavy load of modern news servers, the current common practice usually excludes accessing the news from within the news server. News is now commonly accessed through NNTP protocol.
On an INN news server, there are usually some syslog entries recording the activities of the news server. These activities include user's readership information. If a user has accessed one news group and read some articles, an entry like this:
Jan 28 14:20:48 news nnrpd[8096]: reader.foo.com group \ news.software.nntp 1
will be recorded in the news.notice level. Proper configuration of the syslogd allows this information to be saved in a file, usually called news.notice by administrators. A program can periodically scan this information to find out which groups are currently read by users.
The news.notice file is usually recycled daily by a utility called news.daily. Thus the readership information collected contains at most one day's worth. This information can be kept in a database to represent the readership information over a period of time. The dynamic nature of the readership can be represented by adding new group information to the database each time the news.notice file is scanned and removing old entries that are not detected again in a predefined period of time. To track the time a group was last read, both the group name and the time it was last detected in the news.notice file are recorded in the database.
Adjusting expire.ctl After the readership information is collected and stored in the database, it can be used to adjust the rules in expire.ctl so that the groups found in the database are kept a longer time, while those not found in the database (or the time since they were last seen is longer than a predefined length of time) are expired sooner. For example, if we find news.software.nntp in the database, we know that this group is currently read by some of our users, so articles in it should stay longer. The following expire.ctl:
*:A:0:1:1
#adjusted based on readership
news.software.nntp:A:30:30:30
will keep articles in news.software.nntp for up to 30 days while removing everything else which is one day old.
Automatic Readership Information Collection and expire.ctl Adjustment Using actgroups The readership information collection and expire.ctl file adjustment can be automatically done by a small Perl script called actgroups, shown in Listing 1. The script is named after "active groups" in that it is interested only in those groups that are actively read by local readers. We can see that these groups really are active, but many groups in INN's active file may not be. The first thing actgroups does is scan the news server log file, usually news.notice, to find out the readership information. The scan_log() function does this job. It scans through the news.notice file, finding lines similar to:
Jan 28 14:20:48 news nnrpd[8096]: reader.foo.com group \ news.software.nntp 1
to find out the group name, news.software.nntp.
Any group detected by the scan_log() will be used as the key to an associative array. The value of the element will be the time when actgroups started. This array is mapped to a DBM file so that its values are persistent across the sessions of actgroups. The array actually records the time a group is detected in the news.notice file. This information roughly represents the readership of the group during the time news.notice is recycled.
After readership information is collected in the "%LASTREAD" database, actgroups must decide how to adjust the expire.ctl file to create different expire times for different groups. News administrators can control the decision made by expire and actgroups by putting control rules in the expire.ctl file. The default control rules, as specified above and in expire.ctl(5), control the groups that are not read. actgroups introduces another set of rules, in similar syntax and semantics, to control the adjustment of expire time for the active groups, which are currently actively read. The format of such adjustment control rules is as follows
#% patter:A:minimum:default:maximum
As you can see, except for the leading token #%, everything else is in exactly the same format as an ordinary control rule. The leading token is needed so the expire command will treat them as comments and ignore them.
actgroups will scan expire.ctl, copy every line that is not generated by itself, and establish a list of control rules according to the adjustment control rules the news administrator has specified. This job is done by function scan_expire_ctl().
It scans through the expire.ctl file, and copies everything before the line:
### Actively Accessed Groups Start ###
This line and everything behind it will be automatically generated by actgroups so the old one can be ignored during the scanning.
While scanning the expire.ctl file, it looks for the adjustment control rules. For each line it finds, the pattern and the rest are separated and saved in %rules associative array for later use. The value of each pattern is also saved in @patterns array to preserve the order of such rules. This preserves the semantics of the expire.ctl control rules, so that rules appearing later in the file take precedence over the rules appearing earlier in the file. Thus, the administrator can also specify general adjustment rules first, followed by more specific ones.
Now actgroups has the readership information and the adjustment rules. It is ready to generate adjustment lines for groups that it considers active. The function adjust_expire_ctl() does this job.
It iterates over the %LASTREAD database to find an adjustment rule for each key in the %LASTREAD associative array, which corresponds to a group name detected by actgroups earlier. If a match is found and the time recorded in the database is not more than $purge days old, then an adjustment control line is emitted for the group. All of the expire values are replaced by the ones specified in the corresponding adjustment control rule.
Finally, the old expire.ctl file is backed up, and the newly generated expire.ctl file is renamed to expire.ctl and is ready to be used by expire.
The installation and configuration of the script are very simple. The script can be copied into any location and started at any time. There are only three Perl variables to be set according to individual news server's configuration. They are the location of the %LASTREAD database (which is usually in the same place as history file), the location of news.notice, and the location of expire.ctl. To have actgroups adjust expire time for active groups, set corresponding adjustment rules in the expire.ctl file.
actgroups needs only to be run before expire runs. The best fit may be to run it together with news.daily. To do so, we can pass the full path as the parameter to the news.daily cron job entry, as in:
30 3 * * * /usr/local/news/bin/news.daily delayrm \ /usr/local/news/local/actgroups
news.daily will call actgroups before it starts the expiration process.
Issues
actgroups adjusts the expiration time of a news group based on whether or not the group is read. It requires at least one article to be read by one user to activate the group. If a group was previously marked as inactive and then activated by a user, it will be impossible for the user to search articles older than the default time that was defined in expire.ctl at the time the group was activated. Only after activation will the articles in the group be kept longer. After the group is activated, at least one article must be read in the time defined by purge value; otherwise, the group will be marked as inactive again.
Standard nnrpd server will not send a log entry to syslog if the reader only browses the group without actually reading an article. A simple patch to the nnrpd source can solve this problem. The patch can be found at:
http://www.regentec.com/~yufan/actgroups/actgroups.html
Conclusion
actgroups is a useful tool for managing the news servers that provide news reading service directly to the users. It collects user readership information and adjusts the expiration time for groups that are actively read by user. This tool can help provide maximum news service within limited disk space. It is able to help delay the time necessary to invest and upgrade the disk space needed for your news spool.
actgroups may not be very effective for servers providing news feed service to other servers. In this case, if the default expiration time is set too short, the articles may be expired before they are delivered to other news server.
actgroups has been used with good results for all the INN servers I have administered including newly released INN 2.0 with traditional storage format. The current version is the result of suggestions from many people, and their comments are acknowledged. The original assumption that only a small set of groups are actively read in a period of time has be proven by the number of groups recorded in the %LASTREAD database. Using the script, I have been able to keep a good feed for the "big 8" news groups, excluding binary groups, on a server with only 4 Gb of disk space.
About the Author
Dr. Yufan Hu started his system administration and software development career using UNIX in 1983 on a PDP-11/23. Since then, UNIX System and network administration have been part of his research and development career. He is currently in charge of networking and system-related activities for Regent Electronics Corp. He can be reached at: yufan@recmail.com.
|