How
to Match Common Things
Randal L. Schwartz
Regular expressions can be handy, when used correctly, to distinguish
things of interest from among the strings in which they are hiding
and to reject strings that don't belong. These typical uses
for text manipulation and input validation result in a lot of common
regular expressions to solve these frequent tasks. However, I often
see mistakes in selecting and applying a regular expression, so
let's take a look at some of the more common mistakes. As I
go through the examples, I'll presume the string to be validated
is in $_ just to keep the examples simple, and I'll
also use the slash delimiters (except where otherwise noted) for
the regular expressions.
For example, one frequent check is to determine whether a string
contains a positive integer. If I weren't thinking properly,
I might start with something like /[0-9]+/ to say "one
or more digits". I can simplify this to /\d+/, but that's
still wrong, because the match isn't anchored. This
means that the regular expression will match as long as the string
contains the regular expression, including things like "abc123de".
Oops.
So, the next step is to add anchors. Locking the regular expression
down to both the beginning and ending of the string typically looks
like /^\d+$/. However, this is still wrong, even though I
frequently see this solution. The problem is that $ can match
either before or after a final newline in the string, so this regular
expression can match "123\n" as well as "123". Again,
oops!
Luckily, modern Perl versions provide the \z anchor, which
really does mean "end of string" always. So, the proper
answer is /^\d+\z/. Or is it? Although deprecated, the $*
variable controls the matching of ^ and $ to permit
internal newline matches as well as end-of-string matches. If that
variable is set, the string "foobar\n123" will also match
our new regular expression. Oops again. So the proper answer is
/\A\d+\z/, which says, "beginning of string" followed
by "one or more digits" followed by "end of string".
Precisely, and accurately. Finally!
Now, at what point in this list of progressive regular expressions
were you surprised? If not until the end, good for you. But I hope
you can see that regular expressions are a bit trickier than they
seem.
As an alternative to all those special characters, I might just
consider using a negative match against /\D/: that is, the
string is fine as long as it doesn't contain a non-digit. But
that's not precisely the opposite. See if you can figure out
the one string that matches neither /\D/ nor /\A\d+\z
before reading on.
That's right, the empty string! Again, you need to decide
exactly what you want to match and how you want to match it. Regular
expressions are powerful, but as I recently heard in a movie, "with
great power comes great responsibility".
I don't think more than a few weeks go by before I see someone
attempt to match or validate "an email address" by using
an incorrect regular expression. Most people who are trying to validate
an email address apparently have never heard of the RFCs, such as
RFC 822, which has defined the standard Internet email address since
1982 (invalidating RFC 733 before that).
Because they base the email address on only what they've
seen, they write broken regular expressions such as: /^\w+\@[\w+.]$/.
The attempt here is to match word characters (alphanumerics and
underscore) for both the local part (what we often call the
user name), and the hostnames (to the right of the at sign).
Just starting with the hostname mistake, this excludes the hyphen
(which is a valid hostname character) and includes the underscore
(which isn't). Oops.
But even if you got that part right, through careful examination
of hostnames, paying attention to the two-character, top-level domains
for countries, and the three-, four-, and now more-character, top-level
abstract domains, the big failure here is the left side.
RFC 822 is very liberal about what is accepted for the local part.
Basically, to the left of the equals sign, we see in the RFC that
the definition of "local-part" is one or more period-connected
"words", and that a word is either an atom or a quoted
string, and that an atom is everything that doesn't contain
whitespace or one of the special characters (matching /\(\)<>\@,;:\\"\.\[\]/).
Wait? Does this mean that:
Randal.L.Schwartz@stonehenge.comm
is a valid email address? Yes! And that's already not matched
by our previous regex. But even more, it means that my friend Eli-The-Bearded,
who uses:
*@qz.to
as his email address is also using a valid email address!
Now, if you showed the first address to someone who wrote that
first regular expression, they might quickly "patch up"
that pattern to match periods as well as \w+. But that wouldn't
be sufficient to match Eli's address. And it wouldn't
work on addresses like:
gateway."[foo]@bar//35"@relay.machine.oldcompany.com
where the local part contains quoted parts that need to be properly
passed over when looking for at signs and so on. To do that, you'd
need to create something that mimics the RFC definition (local part
is a series of period-connected words, and each word is either a non-special
string or a quoted string).
But that still wouldn't solve the last problem. RFC822 permits
comments in the email address, enclosed in balanced parentheses.
The example given in the RFC is:
Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
That is, the address is actually Muhammed.Ali@Vegas.WBA, but
the parenthesized parts are legally part of the email address, although
ignored.
Well, that still doesn't look hard, because comments are
permitted only between tokens. But the biting part of the specification
is the word "balanced". If the parentheses can be nested,
there's no way to get a normal regular expression to match
it! (Yes, recent Perl versions have some extra tricks to get Perl
code to execute during the matching of a regular expression, which
would help us solve this, but let's rule that out for now.)
Even if we pre-process these comments and replace them with a
single whitespace, the resulting regular expression (shown at http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html,
for example) is more than 6000 characters long. Not something you're
going to cut and paste into each program, but thankfully we don't
need to do that.
Instead, we can simply pull in Email::Address or Email::Valid
from the CPAN. These modules encapsulate the rules for an RFC822
valid email address. That's the way to get it right.
Other useful regular expressions have been rolled into one module,
Regexp::Common. For example, to match all HTTP URIs in a
string, we can say:
use Regexp::Common qw(URI);
while (/($RE{URI}{HTTP})/g) {
print "the string contains the URI $1\n";
}
Again, we don't have to spend time staring at specifications;
someone has done the work for us.
So, I hope I've scared you enough that you won't go
inventing regular expressions on your own without looking around
a bit for someone else who has gone ahead of you on the problem.
Look at the CPAN first, learn to read every part of a regular expression,
and ask around to see if your solution makes sense. Until next time,
enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration,
security, technical writing, and training. He has coauthored the
"must-have" standards: Programming Perl, Learning
Perl, Learning Perl for Win32 Systems, and Effective
Perl Programming. He's also a frequent contributor to the
Perl newsgroups, and has moderated comp.lang.perl.announce since
its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc. |