Cover V12, I01

Article

jan2003.tar

The Duct Tape of the Internet

Randal L. Schwartz

When you're a Perl programmer, you never fret about those little ugly tasks that creep up. Perl can deal with file wrangling, text manipulation, and process management in a way unequaled by any other single language, whether open source or proprietary.

For example, in this column, I'll take a simple file and text-wrangling task and show how I solved it with Perl. I was a systems administrator for many years, and I'd say that this task is representative of those niggling little things that I faced, typically daily, in the course of my job.

Nearly all Perl modules contain embedded documentation, called "POD" (described by perldoc perlpod). When I install a module from the Comprehensive Perl Archive Network (the "CPAN": see http://www.cpan.org for further information), the module is usually installed into a place where my Perl binary can find it (along Perl's @INC path). By default, the installation process also creates an nroff -man page, so that the man command can display a nicely formatted version (presuming you extend your MANPATH or equivalent). Thus, for most modules, you can say either perldoc Some::Module (to convert the embedded POD into text) or man Some::Module (to display the preprocessed man page).

However, the server that runs http://www.stonehenge.com runs OpenBSD (mostly so I can sleep at night knowing that security is a key point of the OpenBSD developers). The default Perl installation of OpenBSD is configured in such a way that the man pages are not generated for non-core Perl modules. I'm expected to type perldoc Some::Module to get the documentation for the module, instead of the more familiar man Some::Module; however, I can use man for the core modules. Because I found this rather confusing, I faced two alternatives:

1. I could hack the core installation of Perl so that it would install man pages, thereby risking breakage if the Perl installation were upgraded during a minor or major release.

2. I could write a simple tool to take all the embedded POD and generate man pages into my private area.

I decided to write a simple tool, mostly because I'm opposed to touching anything in the core distribution, since I have no idea if someone at OpenBSD headquarters is likely to change things out from under me.

And a simple tool it is, although it's about 80 lines of Perl code. So, looking at a few lines at a time, here's what I wrote, in about the order that I created the lines. To begin, I started with my normal header:

#!/usr/bin/perl -w
use strict;
$|++;
With these three lines, I've turned on warnings, enabled the common compiler restrictions (undeclared variables, soft references, and barewords are all disabled), and turned off the buffering for STDOUT.

Next, I put in a few configuration lines that I might change, based on where I'm running the program:

## BEGIN configuration

my $MAN3DIR = "/home/merlyn/man/man3";
my $MAN3EXT = "3p";

## END configuration
Here I've defined a location below my home directory where I've placed other personal manpages, and an extension for the specific Perl module pages. Traditionally, Perl modules have the 3p extension and are placed in section 3 of the UNIX manual. I've added /home/merlyn/man to my MANPATH, so the man command finds this directory just fine:

use Pod::Man;
use File::Find;
use Config;
Following that, I bring in the three modules (all in the Perl core distribution) that I'll need to wander through the installed directories and find the POD files. The Pod::Man module can convert POD into manpages. The File::Find module recurses through subdirectories. The Config module provides a hash interface to the configuration parameters for the installed Perl. In fact, the next two lines use that hash to locate two specific directories:

my $SITELIB = $Config{sitelib};
my $SITEARCH = $Config{sitearch};
The value for $SITELIB gives the path in which local Perl modules are installed. $SITEARCH provides a similar path for architecture-specific modules -- those which contain binary files resulting from compiling C (or other languages). Generally, the $SITEARCH directory will be within the $SITELIB directory, and this program presumes that.

Next, I'll create a Pod::Man object configured for the task:

my $podmanparser = Pod::Man->new(section => $MAN3EXT);
The section value gives the name appearing in the page header banner, mostly cosmetic, but nice to get right.

Now comes the task of finding the existing POD documentation. So, after a few tries, I came up with the following loop with File::Find:

my %pods;
find sub {
  return unless /\.p(m|od)$/;
  my $package = $File::Find::name;
  for ($package) {
  s{^\Q$SITEARCH/}{}
    or s{^\Q$SITELIB/}{}
      or die "Cannot remove $SITEARCH or $SITELIB from ", $File::Find::name\n";
  s/\.p(m|od)$//
    or die "What happened to the ext in $package?\n";
  s{/}{::}g;
  }
  push @{$pods{$package}}, $File::Find::name;
}, $SITELIB;
There's a lot going on here, and it's best to work from the outside in. The find subroutine has been imported from File::Find and is presented with a subroutine reference (here, an anonymous subroutine) and a starting path, $SITELIB. The find routine starts at the top directory, recursing down, calling the subroutine for each found entry (even ones in which we're not interested). The line:

return unless /\.p(m|od)$/;
rejects the filenames that are neither Perl modules nor Perl POD files by looking at $_, which contains the basename (no directory part) of the file or directory being examined. The next few lines extract the package name for the filename into $package. It takes the full path from $File::Find::name, then removes either the $SITEARCH or $SITELIB prefix from the path. If neither of these succeeds, then something has gone terribly wrong, so it will abort.

Next, these lines:

s/\.p(m|od)$//
  or die "What happened to the ext in $package?\n";
s{/}{::}g;
turn the remainder of the name into a module name, by replacing the slashes with double-colon package delimiters and stripping off the extension. Finally, the loop adds this file name to an arrayref contained within the %pods hash, indexed by the package name. Why a list? Because many modules have a separate POD file, so we'll see both Some/Module.pm and Some/Module.pod. We'll later sort out which of these to use for the manpage, but we'll record them all for now.

When this loop has completed, we have a hash %pods, keyed by package name, with each entry comprising a list of one or more files that may contain the documentation for that module.

When I showed this program to one of my friends, my friend commented (only after I toiled over this part), "Why didn't you just use Pod::Find?". Ah, yes. If I'd only known, I could have reduced this part of the program to a few lines of code. I'll have to file that away for use in a future program. The lesson here is "always check the CPAN first, because any interesting task is likely already written".

The next step is to wander through the hash and do whatever it takes to update the manpages if needed. I'll start with a loop like this:

POD: for my $pod (sort keys %pods) {
  my @files = @{$pods{$pod}};
  ... more code here ...
}
I had to name the loop because we'll see a point later where I want to execute a next against this loop even though I'm in a nested loop. So, $pod contains a package name, and @files contains one or more source files for that package. Next, we need to figure out which one of many source files is needed if there's more than one:

if (@files > 1) {    # more than one?  must sort
@files = sort {
  ## primary: prefer arch-specific over non-arch-specific
  to_boolean($b =~ m{^\Q$SITEARCH}) <=> 
  to_boolean($a =~ m{^\Q$SITEARCH})
    ## secondary: prefer .pod to .pm
    or to_boolean($b =~ /\.pod$/) <=> to_boolean($a =~ /\.pod$/);
} @files;
}
my $file = shift @files;    # first one is always best now
Again, a lot of stuff going on here. If there's more than one file, we'll sort it, preferring architecture-specific files over generic files, and .pod files over .pm files. The first entry in the list after sorting (or the only entry in the list if there was only one to start with) is now the most likely candidate for our manpage.

The to_boolean routine forces false to have 0 and true to have 1, so we can sort nicely:

sub to_boolean {
  $_[0] ? 1 : 0;
}
Next, we'll figure out the name of the manfile, and determine whether we have any work to do:

my $manfile = "$MAN3DIR/$pod.$MAN3EXT";
next if
-e $manfile and
  -M $manfile < -M $file;    # skip if exists and newer
If the manpage file exists, and is newer than our source file, we've got nothing to do, so we continue to the next entry.

At this point, we have a source file (either POD or Perl file), which has not yet been updated into a manpage. However, the file may still contain no POD directives. We need to look for some POD in the file. The easiest way is to look for =head at the beginning of a line. This isn't entirely accurate, but it's the same rule that the perldoc command uses, so I figure it's close enough. And that code came out like this (after a few tries):

open IN, $file
or warn("Cannot open $file, skipping\n"), next POD;
while (<IN>) {
if (/^=head/) {    # POD sign!
  print "pod2man $file $manfile\n";
  not -e $manfile or unlink $manfile
    or warn("Cannot remove $manfile: $!\n");
  open OUT, ">$manfile"
    or warn("Cannot create $manfile: $!\n"), next POD;
  seek IN, 0, 0;
  $podmanparser->parse_from_filehandle(\*IN, \*OUT);
  close OUT;
  next POD;
}
}
The meat is in the middle: once we've determined we have a decent POD file, we seek the file back to the beginning, and then call parse_from_filehandle to generate the manpage.

So, any time I suspect that there's been a new module added to my local install, I can run this program, and my local manpage collection is updated, with minimal effort.

A simple task, simply executed by Perl, but handling an important issue of letting me get at Perl's documentation with either perldoc or man, working around a vendor limitation. Most of those "gotta get it done now with no time to do it" systems administration tasks seem to be about this large, and as you can see, Perl fits the task nicely. So, until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.