jan2005.tar

Checking Your Bookmarks

Randal L. Schwartz

Like most people, I've bookmarked about a third of the known Internet by now. Of course, sites go away, and URLs become invalid, so some of my lesser-used bookmarks are pointing off into 404-land.

Some browsers have an option to periodically revalidate bookmarks. My favorite browser lacks such a feature, but it does include the ability to export an HTML file of all the bookmarks and reimport a similar file in a way that can be easily merged back into my existing bookmark setup. So, I thought I'd take a whack at a Perl-based bookmark validator, especially one that worked in parallel so that I could get through my bookmark list fairly quickly. The result is in Listing 1, below.

Lines 1 through 3 declare the program as a Perl program and turn on the compiler restrictions and warnings as good programming practice.

Lines 5 through 7 pull in three modules that are found in the CPAN. The HTML::Parser module enables my program to cleanly parse HTML with all its intricacies. The LWP::Parallel::UserAgent module provides a means to fetch many Web pages at once. And finally, HTTP::Request::Common sets up an HTTP::Request object so that I can fetch it with the user agent.

Lines 9 and 10 set up the user interface for this program. I can use the program as a filter:

./this_program <Bookmarks.html >NewBookmarks.html

or as an in-place editor:

./this_program Bookmarks.html

As an in-place editor, the Bookmarks.html file will be renamed to Bookmarks.html~ (with an appended tilde), and the new version will appear at the original name.

Lines 11 to 19 edit each file (usually just one) in turn, or the standard input as one file. Line 12 slurps the entire file in to $_. Two passes are performed over the HTML text -- the first pass in line 14 finds the existing links, and the second pass in line 18 edits the HTML with additional DEAD - text for links that were found broken. In between, we'll check the validity of the discovered URLs, in line 16. This is our entire top-level code, using named subroutines to clearly delineate the various phases and couplings of this program. I find it helpful to break down a program in this way.

Let's look at how the links are found, in the subroutine beginning in line 21. First, we'll accept the input parameter in line 22. Second, we'll create a staging variable for the return value in line 24.

Lines 26 to 34 create an HTML::Parser object. Creating a parser object is an art form, because there are so many buttons and dials and levers on the instantiation and later reconfiguration of the parser. My usual trick is to find a similar example and then modify it until it does what I want.

In this case, we want to be notified of all start tags, so we'll define a start handler (line 28) consisting of an anonymous subroutine (lines 29 to 32) and a description of the parameters that will be sent to the subroutine (line 33). We're asking for the tagname (like "a") and the attribute hash as the only two parameters. We extract these parameters in line 30.

Line 31 ignores all a tags that don't have an href attribute, which skips over local anchors and anything else more bizarre. Line 32 creates an element in the hash with the key being the same as the URL. The value is unimportant at this point, although we check whether the value is DEAD later, so that would be a bad value for an initialization.

Once the parser is created, we'll tell it to parse a string and then finish up in line 36 and 37. When start tags are seen, the requested callback is invoked, populating the %urls hash at the appropriate time. At the end of the input string, we'll return a reference to that populated hash so that the caller has some data to manipulate.

The validate_links routine (beginning in line 42) is really the heart of this program, because we'll now take the list of URLs (the keys of the hash in line 43) and verify that they are still dot-com, not dot-bomb.

Line 45 creates the parallel user agent object. This object is a virtual browser with the ability to fetch multiple URLs at once (default 5). The max_size value says that we don't need to see anything past the first byte of the response, so we can stop when the first "chunk" of text has been read from the remote server. (This is actually a feature of LWP::UserAgent, from which LWP::Parallel::UserAgent inherits.)

Lines 47 to 49 set up the list of URLs that the user agent will fetch once activated. We'll just grab the keys (efficiently) from the hash referenced by $urls and call the register method of the user agent with an HTTP::Request object that GETs the corresponding URL.

Line 51 is where our program will spend most of the "real" time. The wait method call tells the user agent to do its job, waiting at most 30 seconds for each connection and response. The result of the wait method is a hashref whose values are LWP::Parallel::UserAgent::Entry objects representing the result of attempting to fetch each page. Calling request on these objects (as in line 52) gives us the original request, while the response method (as in line 53) gives us the corresponding response. We fetch the original URL, and its success status into a couple of variables and then update the hash referenced by $urls with a LIVE/DEAD code in line 54, also logging each result to STDERR for information purposes.

Once we have a hash mapping each URL to a LIVE/DEAD code, it's time to patch up the original file, marking all dead links with a prefix of DEAD -, using the rewrite_html routine beginning in line 60.

Lines 61 and 62 capture the incoming parameters: the original HTML text, and the reference to the hash of the URLs, and their status.

Line 64 sets up a $dead flag. If we see a start tag that begins a link to a dead page, we'll set that flag true, and then update the first following text to include our DEAD - prefix, resetting the variable as needed.

Lines 66 to 87 set up a new HTML::Parser object. This one is a bit more complex than the previous one, because we have to watch for link start tags, the text of links, and copy everything else through.

As before, a start handler is enabled, starting in line 68. Because we're now echoing the input text, we'll ask for the original text as one of the parameters, displayed in line 74.

Lines 71 to 73 determine whether the current tag is indeed a dead link. If so, line 72 sets $dead to 1.

Line 76 defines a text handler, called as the parser recognizes the text of the HTML document. If we see some text, and our $dead flag is set, we'll prefix the existing text with DEAD - and reset the $dead flag. If the text already has the dead flag, we'll leave it alone, so that we don't keep prefixing new additional text on every access. The original or altered text is then printed in line 83.

Lines 85 and 86 define a "default" handler, called for everything else that isn't a start tag or a main text, such as end tags, comments, processing instructions, and so on. Here, we're just passing through everything we don't otherwise care about.

Lines 89 and 90 cause the incoming HTML to be parsed, resulting in the majority of the text being passed unmodified to the default output handle, except for the dead links, which will have been appropriately altered.

And that's all there is! I save the current bookmarks into a file, run the program, wait until it completes, and then I re-import the modified HTML file as my new bookmarks. And now my bookmarks are all fresh and shiny new. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.