Compiling Regular Expressions
Randal L. Schwartz
Perl's regular expressions set Perl apart from most other scripting languages. Some features (like the positive and negative lookahead, and the optional lazy evaluation of multipliers) make matching troublesome strings trivial in Perl. This power has not gone unnoticed by the open software community -- a package called PCRE available on the 'Net (use your favorite search engine) claims to implement "a regular expression matching engine that's compatible with Perl 5.004's syntax".
Just like the strings they are matching, regular expressions can come from many sources. The most common source is inside the program:
@ARGV = qw(/usr/dict/words);
while (<>) {
print if /foo/ || /bar/;
}
This little guy ran in about a quarter of a CPU second for me, and generated a nice list of words that contain foo and bar. Notice that I wrote /foo/ and /bar/ as separate regular expressions, instead of the seemingly identical /foo|bar/. Why did I do that? Experience. As reported by the following program:
@ARGV = qw(/usr/dict/words);
@words = <>;
use Benchmark;
timethese (10 => {
'expression or' =>
'@x = grep /foo/ || /bar/, @words',
'regex or' =>
'@x = grep /foo|bar/, @words',
});
We get the following output from Benchmark:
Benchmark: timing 10 iterations of expression or, regex or...
expression or: 1 wallclock secs ( 0.97 usr + 0.00 sys = 0.97 CPU)
regex or: 3 wallclock secs ( 2.87 usr + 0.00 sys = 2.87 CPU)
This shows that for this example using the regular expression | operator was more than twice as slow. There are certain optimizations that kick in when the text to be matched is a constant string that cannot be done with something more complex, or so says the online documentation. The exact list of optimized things varies from release to release of Perl, so Benchmark is your friend.
Often, the regular expression will come from a computed string, such as the value of a Web form field or a command-line parameter or the result of a prompted query. The Perl syntax lets us interpolate a variable into a regular expression, permitting the regular expression to be created at runtime:
@ARGV = qw(/usr/dict/words);
@words = <>;
$regex1 = "foo";
@result = grep /$regex1/, @words;
In order to be useful, a regular expression first must be "compiled". This is often an expensive step compared to the actual matching of the string to the regular expression. So, usually, this compilation happens while the Perl program itself is parsed and compiled, before the execution of the program. However, when part of a regular expression is a variable (such as the example above), Perl can't do that, and defers the regular expression compilation until execution time.
While this provides a powerful option (creating a regular expression at runtime, it comes with a performance penalty if used incorrectly). Let's test this out:
@ARGV = qw(/usr/dict/words);
@words = <>;
push @words, @words;
push @words, @words;
use Benchmark;
$foo = '[f][o][o]';
$bar = '[b][a][r]';
timethese (5 => {
'constant' =>
'@x = grep /[f][o][o]/ || /[b][a][r]/, @words',
'one variable' =>
'@x = grep /$foo/ || /[b][a][r]/, @words',
'two variables' =>
'@x = grep /$foo/ || /$bar/, @words',
});
Here's the result:
Benchmark: timing 5 iterations of constant, one variable, two
variables...
constant: 3 wallclock secs ( 2.86 usr + 0.00 sys = 2.86 CPU)
one variable: 4 wallclock secs ( 3.49 usr + 0.00 sys = 3.49 CPU)
two variables: 4 wallclock secs ( 4.11 usr + 0.00 sys = 4.11 CPU)
Notice that we're paying a penalty for the recompilation of the regular expression. (I made the regular expression a little more complicated and used a little more data to make it obvious.)
Why is this? Well, on each match between an element of @words and the regular expression defined with a variable, Perl has to recompile the regular expression over, and over, and over again.
One way to fix this is to use the "once-only" modifier on a regular expression. By appending a o suffix to the regular expression match operator, any deferred compilation will be executed only once per program invocation. Let's see if this helps:
@ARGV = qw(/usr/dict/words);
@words = <>;
push @words, @words;
push @words, @words;
use Benchmark;
$foo = '[f][o][o]';
$bar = '[b][a][r]';
timethese (5 => {
'constant' =>
'@x = grep /[f][o][o]/ || /[b][a][r]/, @words',
'two variables' =>
'@x = grep /$foo/ || /$bar/, @words',
'two variables - opt' =>
'@x = grep /$foo/o || /$bar/o, @words',
});
When we see the results, we confirm that it has helped significantly:
Benchmark: timing 5 iterations of constant, two variables, two
variables - opt...
constant: 3 wallclock secs ( 2.86 usr + 0.01 sys = 2.87 CPU)
two variables: 4 wallclock secs ( 4.15 usr + 0.01 sys = 4.16 CPU)
two variables - opt: 3 wallclock secs ( 2.98 usr + 0.00 sys =
2.98 CPU)
Yes, now we're getting close to the original compiled-in regular expression speed. But there's a downside to using this once-only flag -- we can't change the regular expression after it has been used once.
So, code like this is fine:
$var = param('search_for');
@results = grep /$var/o, @input;
But code like this is very broken and hard to track down:
@ARGV = qw(/usr/dict/words);
@words = <>;
for $item (qw(foo bar)) {
@results = grep /$item/o, @words;
print @results. " words match $item\n";
}
This prints:
43 words match foo
43 words match bar
instead of the proper answer:
43 words match foo
131 words match bar
which I got after removing the o suffix. That's because the foo string got memorized into the regular expression compilation, even after $item changed for the second iteration.
So, what are we to do? How do we get the speed of a once-compiled regular expression but still be able to loop through many search patterns?
One way is to use an anonymous subroutine that is compiled once for each change in the pattern variable. That'd look like this:
@ARGV = qw(/usr/dict/words);
@words = <>;
for $item (qw(foo bar)) {
$match = eval 'sub { $_[0] =~ /$item/o }';
@results = grep $match->($_), @words;
print @results. " words match $item\n";
}
which again prints the right values. What's happening here is a bit weird -- we're still using $item as the pattern, but because each iteration of the loop recompiles the anonymous subroutine (referenced by $match), the once-only flag effectively resets.
Of course, we're throwing away the result of the compilation on the next iteration of the loop, but at least we're not recompiling for each item in @words.
You can even make a subroutine that makes these subroutines:
sub make_match {
my $item = shift;
eval 'sub { $_[0] =~ /$item/o }';
$match_foo = make_match "foo";
$match_bar = make_match "bar";
@foo_words = grep $match_foo->($_), @words;
@bar_words = grep $match_bar->($_), @words;
Here, the reference to $item in the anonymous subroutine generates a "closure", which remembers the value of $item independently as long as the subroutine is alive.
And then, there's Perl 5.005 to the rescue! The qr// quoting operator was introduced in this latest version of Perl. The purpose of the operator is to provide a way to compile regular expressions and pass the compiled values around in variables, rather than the original source strings.
Let's fix that search again:
@ARGV = qw(/usr/dict/words);
@words = <>;
for $item (qw(foo bar)) {
$compiled = qr/$item/;
@results = grep /$compiled/, @words;
print @results. " words match $item\n";
}
Ahh yes, the right answer again. And Perl compiles the regular expression once, rather than each time the string is seen. We could even compile them all at once:
@patterns = map { qr/$_/ } qw(foo bar);
and then use @patterns interpolated as we would have used the original strings.
I hope you've enjoyed this little compilation about compiling regular expressions. Until next time, enjoy.
About the Author
Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has co-authored Programming Perl, Learning Perl , Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques, Performance Computing, Sys Admin, and Linux magazines. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc. Randal can be reached for comment at: merlyn@stonehenge.com and welcomes questions on Perl and other related topics. |