Commonly
Used Parsing Techniques
Randal L. Schwartz
I write a lot of Perl code that contains a significant amount
of text parsing -- to find the interesting bits of the text
(data reduction) or to break the text into pieces (lexical analysis).
In this column, I'll show how I use regular expressions to
accomplish those tasks.
The simplest style of matching and parsing is the regular expression
being tested simply for its truth value, most commonly used against
the value in $_. For example, a line of the output of the
Unix who command looks like this:
merlyn tty42 Dec 7 19:41
which gives the username, login terminal, and the date and time of
login. If I want to count how many times I've logged in, I can
use:
my $count;
for ('who') { # one line is in $_
if (/^merlyn\b/) { # line begins with merlyn
$count++; # count it
}
}
print "merlyn is logged in $count times\n";
The regular expression returns true if merlyn is found at the
beginning of the line, otherwise false. Although a simple true/false
value return is nice and useful, more often I'll want to know
something about the match. In this case, the match variables come
in handy:
if (/^(merlyn|root)\b/) { # it matches
$admins{$1}++;
}
Here, the match is looking for a line beginning with either merlyn
or root, and that word ends up in $1, because of a valid
match. If I want to capture the location as well, I can grab a bit
more:
if (/^(merlyn|root)\s+(\S+)/ {
$admins{$1} .= "$2\n";
}
If I have a successful match, the username is in $1, and the
terminal is in $2. Each terminal is accumulated into a newline-terminated
string for the corresponding user.
At this point, the regex is messy enough that I probably should
add some comments for the poor chap who has to maintain my code
when I'm gone. Luckily, I can pop in to extended-regex mode
and include some insignificant whitespace:
if (m{
^ # beginning of line
( # start $1
merlyn # merlyn
| # or
root # root
) # end $1
\s+ # some whitespace
( # start $2
\S+ # one or more non-whitespace
) # end $2
}x) {
$admins{$1}++;
}
Now my illustrative text is alongside the parts being referenced,
which I hope will clarify exactly what's happening.
Rather than trying to remember what I meant for $1 and
$2, I can name them at the time of the match:
if (my ($user, $tty) = /^(merlyn|root)\s+(\S+)/) {
$admins{$user} .= "$tty\n";
}
Each of the $1, $2 (and so on) variables are returned
as a list from a successful match, and I can assign those directly
to easy-to-remember variable names. If the match fails, an empty list
is returned, causing the variables to become undef. More importantly,
a failed match means I have a zero-length list-assignment that, when
viewed in a scalar context (as in when I am looking for a true/false
value for the if), means the if also fails. I can use
this to create a multi-way attempt to match text:
if (my ($user, $tty) = /^(merlyn|root)\s+(\S+)/) {
$admins{$user} .= "$tty\n";
} elsif (my ($user, $tty) = /^(\S+)\s+(\S+)/) {
$users{$user} .= "$tty\n";
} else {
warn "I don't understand $_";
}
If the first match succeeds, I have the named variables set appropriately,
localized to that branch of the match. If that fails, Perl tries the
next (more general) match, executing the appropriate code if that
works. If the more general match had appeared before the specific
match, the specific match would never have been triggered, so it's
important to get those in the right order. If both matches fail, I
get a nice error message so I know I've failed to account for
some kind of line in the input.
The output line of who has three fields that are separated
by some amount of whitespace, although the third field (the date
and time of login) also contains embedded whitespace. If I wanted
to grab the entire line into its three parts, I'd have to use
a full regular expression:
for ('who') {
my ($user, $tty, $time) = m{
^ # beginning of line
(\S+) # user as $1
\s+ # whitespace gap
(\S+) # tty as $2
\s+ # whitespace gap
(.*) # date/time of login as $3
$ # end of line
}x or die "Weird who line: $!";
...; # rest of loop
}
Here, the (.*) grabs the entire remainder of the line, including
the embedded whitespace. Note that I'm careful to bail out of
the program on unusual lines. A gentler approach would have been to
just issue a warn and then go on to the next line. Always consider
what happens when text does not match.
If the line I'm parsing isn't "mixed delimiters"
like that last example, but rather something consistent, like whitespace
always delimiting non-whitespaced data, then I choose the simpler
split operator:
my @words = split /\s+/, $line;
In this case, the regular expression provides the description of the
parts I throw away, leading one of my friends to call that the "deliminator".
Here, I'm using \s+, not a simple \s, so that a
run of consecutive whitespace is considered one big delimiter, rather
than a series of delimiters around empty fields. For an example, let's
look at the differences between these two entries:
my @x = split /:/, "ab:cd:::ef:gh";
my @y = split /:+/, "ab:cd:::ef:gh";
In the first case, I get two empty fields between cd and ef,
because Perl found three delimiters there. In the second case, Perl
found one big fat delimiter between cd and ef, so there
is no empty field present.
Sometimes, it's easier to talk about what to keep instead
of what to throw away. For that, I grab a //g match in a
list context:
my @words = "What did he say?" =~ /([A-Za-z']+)/g;
Here, the regular expression is dragged through a string, picking
up all matches, ignoring the gaps between that don't match. While
I could have written this with split with some work, it gets
harder to do so for the more general case.
Another common task is to take a string and break it down into
its individual parts for further analysis. For example, suppose
I wanted to pull apart that previous sentence as a series of four
words followed by a punctuation mark, noting each category of thing
as I go along. In other words, I want to end up with an array of
arrayrefs that looks like:
my @sentence = (
[word => "What"],
[word => "did"],
[word => "he"],
[word => "say"],
[punct => "?"],
);
There are two main ways to do this. Back in the old days, I did this
by destructively nibbling away at the string:
$_ = "What did he say?";
my @sentence;
while (length $_) {
if (s/^([A-Za-z']+)//) {
push @sentence, [word => $1];
} elsif (s/^([,.?!])//) {
push @sentence, [punct => $1];
} elsif (s/^\s+//) {
# ignore whitespace
} else {
die "I cannot parse the remainder of $_";
}
}
As each "thing" of interest is noted, it's removed
from the front of the string. Again, if I can't figure out what's
there, the loop aborts.
The downside to this approach is that I'm destroying the
string as I am parsing it. While that's fine for small strings
(presuming I'm working on a copy and not the original), it's
a real pain for larger strings for Perl to be continually removing
bits and pieces from the front.
With modern Perl implementations, I can instead use the pos()
value of the string as a virtual pointer, anchoring each new match
there and inch-worming along in a similar way. Compare:
$_ = "What did he say?";
my @sentence;
until (/\G$/gc) { # until pos at end of string
if (/\G([A-Za-z']+)/gc) {
push @sentence, [word => $1];
} elsif (/\G([,.?!])/gc) {
push @sentence, [punct => $1];
} elsif (/\G\s+/gc) {
# ignore whitespace
} else {
die "I cannot parse the remainder of ", /\G(.*)/gc;
}
}
On each of these matches, the pos() pointer for $_ is
advanced if the match succeeds, or left alone if the match fails (thanks
to the /c option). The pos() pointer provides the anchor
for the start of each match as well, acting like the beginning of
string did for the nibbling version. This is a very speedy technique
and a good one to keep in mind.
I hope you enjoyed this little journey into commonly used parsing
techniques. Until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration,
security, technical writing, and training. He has coauthored the
"must-have" standards: Programming Perl, Learning
Perl, Learning Perl for Win32 Systems, and Effective
Perl Programming. He's also a frequent contributor to the
Perl newsgroups, and has moderated comp.lang.perl.announce since
its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc. |