Checking
Your Bookmarks
Randal L. Schwartz
Like most people, I've bookmarked about a third of the known Internet
by now. Of course, sites go away, and URLs become invalid, so some
of my lesser-used bookmarks are pointing off into 404-land.
Some browsers have an option to periodically revalidate bookmarks.
My favorite browser lacks such a feature, but it does include
the ability to export an HTML file of all the bookmarks and reimport
a similar file in a way that can be easily merged back into my existing
bookmark setup. So, I thought I'd take a whack at a Perl-based bookmark
validator, especially one that worked in parallel so that I could
get through my bookmark list fairly quickly. The result is in Listing
1, below.
Lines 1 through 3 declare the program as a Perl program and turn
on the compiler restrictions and warnings as good programming practice.
Lines 5 through 7 pull in three modules that are found in the
CPAN. The HTML::Parser module enables my program to cleanly
parse HTML with all its intricacies. The LWP::Parallel::UserAgent
module provides a means to fetch many Web pages at once. And finally,
HTTP::Request::Common sets up an HTTP::Request object
so that I can fetch it with the user agent.
Lines 9 and 10 set up the user interface for this program. I can
use the program as a filter:
./this_program <Bookmarks.html >NewBookmarks.html
or as an in-place editor:
./this_program Bookmarks.html
As an in-place editor, the Bookmarks.html file will be renamed
to Bookmarks.html~ (with an appended tilde), and the new version
will appear at the original name.
Lines 11 to 19 edit each file (usually just one) in turn, or the
standard input as one file. Line 12 slurps the entire file in to
$_. Two passes are performed over the HTML text -- the first
pass in line 14 finds the existing links, and the second pass in
line 18 edits the HTML with additional DEAD - text for links
that were found broken. In between, we'll check the validity of
the discovered URLs, in line 16. This is our entire top-level code,
using named subroutines to clearly delineate the various phases
and couplings of this program. I find it helpful to break down a
program in this way.
Let's look at how the links are found, in the subroutine beginning
in line 21. First, we'll accept the input parameter in line 22.
Second, we'll create a staging variable for the return value in
line 24.
Lines 26 to 34 create an HTML::Parser object. Creating
a parser object is an art form, because there are so many buttons
and dials and levers on the instantiation and later reconfiguration
of the parser. My usual trick is to find a similar example and then
modify it until it does what I want.
In this case, we want to be notified of all start tags, so we'll
define a start handler (line 28) consisting of an anonymous subroutine
(lines 29 to 32) and a description of the parameters that will be
sent to the subroutine (line 33). We're asking for the tagname
(like "a") and the attribute hash as the only two parameters. We
extract these parameters in line 30.
Line 31 ignores all a tags that don't have an href
attribute, which skips over local anchors and anything else more
bizarre. Line 32 creates an element in the hash with the key being
the same as the URL. The value is unimportant at this point, although
we check whether the value is DEAD later, so that would be
a bad value for an initialization.
Once the parser is created, we'll tell it to parse a string and
then finish up in line 36 and 37. When start tags are seen, the
requested callback is invoked, populating the %urls hash
at the appropriate time. At the end of the input string, we'll return
a reference to that populated hash so that the caller has some data
to manipulate.
The validate_links routine (beginning in line 42) is really
the heart of this program, because we'll now take the list of URLs
(the keys of the hash in line 43) and verify that they are still
dot-com, not dot-bomb.
Line 45 creates the parallel user agent object. This object is
a virtual browser with the ability to fetch multiple URLs at once
(default 5). The max_size value says that we don't need to
see anything past the first byte of the response, so we can stop
when the first "chunk" of text has been read from the remote server.
(This is actually a feature of LWP::UserAgent, from which
LWP::Parallel::UserAgent inherits.)
Lines 47 to 49 set up the list of URLs that the user agent will
fetch once activated. We'll just grab the keys (efficiently) from
the hash referenced by $urls and call the register
method of the user agent with an HTTP::Request object that
GETs the corresponding URL.
Line 51 is where our program will spend most of the "real" time.
The wait method call tells the user agent to do its job,
waiting at most 30 seconds for each connection and response. The
result of the wait method is a hashref whose values are LWP::Parallel::UserAgent::Entry
objects representing the result of attempting to fetch each page.
Calling request on these objects (as in line 52) gives us
the original request, while the response method (as in line
53) gives us the corresponding response. We fetch the original URL,
and its success status into a couple of variables and then update
the hash referenced by $urls with a LIVE/DEAD code in line
54, also logging each result to STDERR for information purposes.
Once we have a hash mapping each URL to a LIVE/DEAD code, it's
time to patch up the original file, marking all dead links with
a prefix of DEAD -, using the rewrite_html routine
beginning in line 60.
Lines 61 and 62 capture the incoming parameters: the original
HTML text, and the reference to the hash of the URLs, and their
status.
Line 64 sets up a $dead flag. If we see a start tag that
begins a link to a dead page, we'll set that flag true, and then
update the first following text to include our DEAD - prefix,
resetting the variable as needed.
Lines 66 to 87 set up a new HTML::Parser object. This one
is a bit more complex than the previous one, because we have to
watch for link start tags, the text of links, and copy everything
else through.
As before, a start handler is enabled, starting in line 68. Because
we're now echoing the input text, we'll ask for the original text
as one of the parameters, displayed in line 74.
Lines 71 to 73 determine whether the current tag is indeed a dead
link. If so, line 72 sets $dead to 1.
Line 76 defines a text handler, called as the parser recognizes
the text of the HTML document. If we see some text, and our $dead
flag is set, we'll prefix the existing text with DEAD - and
reset the $dead flag. If the text already has the dead flag,
we'll leave it alone, so that we don't keep prefixing new additional
text on every access. The original or altered text is then printed
in line 83.
Lines 85 and 86 define a "default" handler, called for everything
else that isn't a start tag or a main text, such as end tags, comments,
processing instructions, and so on. Here, we're just passing through
everything we don't otherwise care about.
Lines 89 and 90 cause the incoming HTML to be parsed, resulting
in the majority of the text being passed unmodified to the default
output handle, except for the dead links, which will have been appropriately
altered.
And that's all there is! I save the current bookmarks into a file,
run the program, wait until it completes, and then I re-import the
modified HTML file as my new bookmarks. And now my bookmarks are
all fresh and shiny new. Until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration, security,
technical writing, and training. He has coauthored the "must-have"
standards: Programming Perl, Learning Perl, Learning
Perl for Win32 Systems, and Effective Perl Programming.
He's also a frequent contributor to the Perl newsgroups, and has
moderated comp.lang.perl.announce since its inception. Since 1985,
Randal has owned and operated Stonehenge Consulting Services, Inc.
|