Commonly Used Parsing Techniques

Randal L. Schwartz

I write a lot of Perl code that contains a significant amount of text parsing -- to find the interesting bits of the text (data reduction) or to break the text into pieces (lexical analysis). In this column, I'll show how I use regular expressions to accomplish those tasks.

The simplest style of matching and parsing is the regular expression being tested simply for its truth value, most commonly used against the value in $_. For example, a line of the output of the Unix who command looks like this:

merlyn     tty42  Dec 7 19:41

which gives the username, login terminal, and the date and time of login. If I want to count how many times I've logged in, I can use:

my $count;
for ('who') { # one line is in $_
  if (/^merlyn\b/) { # line begins with merlyn
    $count++; # count it
  }
}
print "merlyn is logged in $count times\n";

The regular expression returns true if merlyn is found at the beginning of the line, otherwise false. Although a simple true/false value return is nice and useful, more often I'll want to know something about the match. In this case, the match variables come in handy:

if (/^(merlyn|root)\b/) { # it matches
  $admins{$1}++;
}

Here, the match is looking for a line beginning with either merlyn or root, and that word ends up in $1, because of a valid match. If I want to capture the location as well, I can grab a bit more:

if (/^(merlyn|root)\s+(\S+)/ {
  $admins{$1} .= "$2\n";
}

If I have a successful match, the username is in $1, and the terminal is in $2. Each terminal is accumulated into a newline-terminated string for the corresponding user.

At this point, the regex is messy enough that I probably should add some comments for the poor chap who has to maintain my code when I'm gone. Luckily, I can pop in to extended-regex mode and include some insignificant whitespace:

if (m{
  ^        # beginning of line
  (        # start $1
    merlyn      # merlyn
    |           # or
    root   # root
  )        # end $1
  \s+          # some whitespace
  (        # start $2
    \S+    # one or more non-whitespace
  )            # end $2
}x) {
  $admins{$1}++;
}

Now my illustrative text is alongside the parts being referenced, which I hope will clarify exactly what's happening.

Rather than trying to remember what I meant for $1 and $2, I can name them at the time of the match:

if (my ($user, $tty) = /^(merlyn|root)\s+(\S+)/) {
  $admins{$user} .= "$tty\n";
}

Each of the $1, $2 (and so on) variables are returned as a list from a successful match, and I can assign those directly to easy-to-remember variable names. If the match fails, an empty list is returned, causing the variables to become undef. More importantly, a failed match means I have a zero-length list-assignment that, when viewed in a scalar context (as in when I am looking for a true/false value for the if), means the if also fails. I can use this to create a multi-way attempt to match text:

if (my ($user, $tty) = /^(merlyn|root)\s+(\S+)/) {
  $admins{$user} .= "$tty\n";
} elsif (my ($user, $tty) = /^(\S+)\s+(\S+)/) {
  $users{$user} .= "$tty\n";
} else {
  warn "I don't understand $_";
}

If the first match succeeds, I have the named variables set appropriately, localized to that branch of the match. If that fails, Perl tries the next (more general) match, executing the appropriate code if that works. If the more general match had appeared before the specific match, the specific match would never have been triggered, so it's important to get those in the right order. If both matches fail, I get a nice error message so I know I've failed to account for some kind of line in the input.

The output line of who has three fields that are separated by some amount of whitespace, although the third field (the date and time of login) also contains embedded whitespace. If I wanted to grab the entire line into its three parts, I'd have to use a full regular expression:

for ('who') {
  my ($user, $tty, $time) = m{
     ^       # beginning of line
     (\S+)   # user as $1
     \s+     # whitespace gap
     (\S+)   # tty as $2
     \s+     # whitespace gap
     (.*)    # date/time of login as $3
     $       # end of line
   }x or die "Weird who line: $!";
   ...; # rest of loop
 }

Here, the (.*) grabs the entire remainder of the line, including the embedded whitespace. Note that I'm careful to bail out of the program on unusual lines. A gentler approach would have been to just issue a warn and then go on to the next line. Always consider what happens when text does not match.

If the line I'm parsing isn't "mixed delimiters" like that last example, but rather something consistent, like whitespace always delimiting non-whitespaced data, then I choose the simpler split operator:

my @words = split /\s+/, $line;

In this case, the regular expression provides the description of the parts I throw away, leading one of my friends to call that the "deliminator". Here, I'm using \s+, not a simple \s, so that a run of consecutive whitespace is considered one big delimiter, rather than a series of delimiters around empty fields. For an example, let's look at the differences between these two entries:

my @x = split /:/,  "ab:cd:::ef:gh";
my @y = split /:+/, "ab:cd:::ef:gh";

In the first case, I get two empty fields between cd and ef, because Perl found three delimiters there. In the second case, Perl found one big fat delimiter between cd and ef, so there is no empty field present.

Sometimes, it's easier to talk about what to keep instead of what to throw away. For that, I grab a //g match in a list context:

my @words = "What did he say?" =~ /([A-Za-z']+)/g;

Here, the regular expression is dragged through a string, picking up all matches, ignoring the gaps between that don't match. While I could have written this with split with some work, it gets harder to do so for the more general case.

Another common task is to take a string and break it down into its individual parts for further analysis. For example, suppose I wanted to pull apart that previous sentence as a series of four words followed by a punctuation mark, noting each category of thing as I go along. In other words, I want to end up with an array of arrayrefs that looks like:

   my @sentence = (
     [word => "What"],
     [word => "did"],
     [word => "he"],
     [word => "say"],
     [punct => "?"],
   );

There are two main ways to do this. Back in the old days, I did this by destructively nibbling away at the string:

$_ = "What did he say?";
my @sentence;
while (length $_) {
  if (s/^([A-Za-z']+)//) {
    push @sentence, [word => $1];
  } elsif (s/^([,.?!])//) {
    push @sentence, [punct => $1];
  } elsif (s/^\s+//) {
    # ignore whitespace
  } else {
    die "I cannot parse the remainder of $_";
  }
}

As each "thing" of interest is noted, it's removed from the front of the string. Again, if I can't figure out what's there, the loop aborts.

The downside to this approach is that I'm destroying the string as I am parsing it. While that's fine for small strings (presuming I'm working on a copy and not the original), it's a real pain for larger strings for Perl to be continually removing bits and pieces from the front.

With modern Perl implementations, I can instead use the pos() value of the string as a virtual pointer, anchoring each new match there and inch-worming along in a similar way. Compare:

$_ = "What did he say?";
my @sentence;
until (/\G$/gc) { # until pos at end of string
  if (/\G([A-Za-z']+)/gc) {
    push @sentence, [word => $1];
  } elsif (/\G([,.?!])/gc) {
    push @sentence, [punct => $1];
  } elsif (/\G\s+/gc) {
    # ignore whitespace
  } else {
    die "I cannot parse the remainder of ", /\G(.*)/gc;
  }
}

On each of these matches, the pos() pointer for $_ is advanced if the match succeeds, or left alone if the match fails (thanks to the /c option). The pos() pointer provides the anchor for the start of each match as well, acting like the beginning of string did for the nibbling version. This is a very speedy technique and a good one to keep in mind.

I hope you enjoyed this little journey into commonly used parsing techniques. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.