How to Match Common Things

Randal L. Schwartz

Regular expressions can be handy, when used correctly, to distinguish things of interest from among the strings in which they are hiding and to reject strings that don't belong. These typical uses for text manipulation and input validation result in a lot of common regular expressions to solve these frequent tasks. However, I often see mistakes in selecting and applying a regular expression, so let's take a look at some of the more common mistakes. As I go through the examples, I'll presume the string to be validated is in $_ just to keep the examples simple, and I'll also use the slash delimiters (except where otherwise noted) for the regular expressions.

For example, one frequent check is to determine whether a string contains a positive integer. If I weren't thinking properly, I might start with something like /[0-9]+/ to say "one or more digits". I can simplify this to /\d+/, but that's still wrong, because the match isn't anchored. This means that the regular expression will match as long as the string contains the regular expression, including things like "abc123de". Oops.

So, the next step is to add anchors. Locking the regular expression down to both the beginning and ending of the string typically looks like /^\d+$/. However, this is still wrong, even though I frequently see this solution. The problem is that $ can match either before or after a final newline in the string, so this regular expression can match "123\n" as well as "123". Again, oops!

Luckily, modern Perl versions provide the \z anchor, which really does mean "end of string" always. So, the proper answer is /^\d+\z/. Or is it? Although deprecated, the $* variable controls the matching of ^ and $ to permit internal newline matches as well as end-of-string matches. If that variable is set, the string "foobar\n123" will also match our new regular expression. Oops again. So the proper answer is /\A\d+\z/, which says, "beginning of string" followed by "one or more digits" followed by "end of string". Precisely, and accurately. Finally!

Now, at what point in this list of progressive regular expressions were you surprised? If not until the end, good for you. But I hope you can see that regular expressions are a bit trickier than they seem.

As an alternative to all those special characters, I might just consider using a negative match against /\D/: that is, the string is fine as long as it doesn't contain a non-digit. But that's not precisely the opposite. See if you can figure out the one string that matches neither /\D/ nor /\A\d+\z before reading on.

That's right, the empty string! Again, you need to decide exactly what you want to match and how you want to match it. Regular expressions are powerful, but as I recently heard in a movie, "with great power comes great responsibility".

I don't think more than a few weeks go by before I see someone attempt to match or validate "an email address" by using an incorrect regular expression. Most people who are trying to validate an email address apparently have never heard of the RFCs, such as RFC 822, which has defined the standard Internet email address since 1982 (invalidating RFC 733 before that).

Because they base the email address on only what they've seen, they write broken regular expressions such as: /^\w+\@[\w+.]$/. The attempt here is to match word characters (alphanumerics and underscore) for both the local part (what we often call the user name), and the hostnames (to the right of the at sign). Just starting with the hostname mistake, this excludes the hyphen (which is a valid hostname character) and includes the underscore (which isn't). Oops.

But even if you got that part right, through careful examination of hostnames, paying attention to the two-character, top-level domains for countries, and the three-, four-, and now more-character, top-level abstract domains, the big failure here is the left side.

RFC 822 is very liberal about what is accepted for the local part. Basically, to the left of the equals sign, we see in the RFC that the definition of "local-part" is one or more period-connected "words", and that a word is either an atom or a quoted string, and that an atom is everything that doesn't contain whitespace or one of the special characters (matching /<>\@,;:\\"\.\[\]/).

Wait? Does this mean that:

Randal.L.Schwartz@stonehenge.comm

is a valid email address? Yes! And that's already not matched by our previous regex. But even more, it means that my friend Eli-The-Bearded, who uses:

*@qz.to

as his email address is also using a valid email address!

Now, if you showed the first address to someone who wrote that first regular expression, they might quickly "patch up" that pattern to match periods as well as \w+. But that wouldn't be sufficient to match Eli's address. And it wouldn't work on addresses like:

gateway."[foo]@bar//35"@relay.machine.oldcompany.com

where the local part contains quoted parts that need to be properly passed over when looking for at signs and so on. To do that, you'd need to create something that mimics the RFC definition (local part is a series of period-connected words, and each word is either a non-special string or a quoted string).

But that still wouldn't solve the last problem. RFC822 permits comments in the email address, enclosed in balanced parentheses. The example given in the RFC is:

Muhammed.(I am  the greatest) Ali @(the)Vegas.WBA

That is, the address is actually Muhammed.Ali@Vegas.WBA, but the parenthesized parts are legally part of the email address, although ignored.

Well, that still doesn't look hard, because comments are permitted only between tokens. But the biting part of the specification is the word "balanced". If the parentheses can be nested, there's no way to get a normal regular expression to match it! (Yes, recent Perl versions have some extra tricks to get Perl code to execute during the matching of a regular expression, which would help us solve this, but let's rule that out for now.)

Even if we pre-process these comments and replace them with a single whitespace, the resulting regular expression (shown at http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html, for example) is more than 6000 characters long. Not something you're going to cut and paste into each program, but thankfully we don't need to do that.

Instead, we can simply pull in Email::Address or Email::Valid from the CPAN. These modules encapsulate the rules for an RFC822 valid email address. That's the way to get it right.

Other useful regular expressions have been rolled into one module, Regexp::Common. For example, to match all HTTP URIs in a string, we can say:

use Regexp::Common qw(URI);

while (/($RE{URI}{HTTP})/g) {
  print "the string contains the URI $1\n";
}

Again, we don't have to spend time staring at specifications; someone has done the work for us.

So, I hope I've scared you enough that you won't go inventing regular expressions on your own without looking around a bit for someone else who has gone ahead of you on the problem. Look at the CPAN first, learn to read every part of a regular expression, and ask around to see if your solution makes sense. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.