Data Validation with Regular Expressions

One of the best ways to get a better feel for how regular expressions work is to see some examples and to try them for ourselves. In this section, we use some of the data validation tasks in our web applications to demonstrate how we can use regular expressions to perform these tasks.

Validating Usernames

When a new user is creating an account with our system, we might require the user to create a new username. We often place a number of restrictions on this username, such as that it must consist entirely of alphanumeric characters (that is, ASCII letters and numbers only), must not contain punctuation or whitespace, and must be between 8 and 50 characters long.

The regular expression for this turns out to be pretty simple: [[:alnum:]]{8,50}. If we want to relax our restrictions a little bit and allow a few other characters in usernamesnamely spaces, underscores (_), and dashes (-)we can change the regular expression to [[:alnum:] _-]{8,50}.

We can then use the ereg function to actually make sure that a username conforms to this pattern, as follows:

<?php

  $valid = ereg($_POST['user_name'], '[[:alnum:] _-]{8,50}');
  if (!$valid)
  {
    // show error saying username is invalid ...
  }

?>

Matching Phone Numbers

A slightly more interesting example would be to match U.S. and Canadian telephone numbers. In their most basic forms, these are a sequence of seven digits, usually separated by some character such as a space, a dash (-), or a dot (.). A regular expression for this would be as follows:

[0-9]{3,3}[-. ]?[0-9]{4,4}

This simple expression says match exactly three digits ([0-9]{3,3}), followed optionally by a single dash, period, or space ([-. ]?), and then match exactly four more digits ([0-9]{4,4}).

To add in the area code, which is itself a three-digit number, is a bit more interesting. This can optionally be wrapped in parentheses, or not wrapped in parentheses but separated from the other digits by a space, dash, or dot. Our regular expression begins to get more complicated. The new portion of the expression to match the area code will look like this:

\(?[0-9]{3,3}\)?[-. ]?

Because the ( and ) characters are used by regular expressions, we have to escape them with the backslash (\) to use them as characters we want to match. Our complete regular expression thus far would be this:

\(?[0-9]{3,3}\)?[-. ]?[0-9]{3,3}[-. ]?[0-9]{4,4}

If you look closely at the preceding expression, however, you should see that in addition to correctly matching strings such as (###)###-####, it also matches strings such as (###)-###-####, which might not be what we want. To improve this, we could use some grouping:

(\(?[0-9]{3,3}\)?|[0-9]{3,3}[-. ]?)[0-9]{3,3}[-. ]?[0-9]{4,4}

The new area code portion of the expression

(\(?[0-9]{3,3}\)?|[0-9]{3,3}[-. ]?)

consists of the same two parts it did before, but now they are in a group (denoted by the unescaped ( and )), and the | character indicates that only one of the two can occur.

Our regular expression now refuses to accept strings such as (###)-###-####. Upon some reflection, however, we do not care what format the user enters the phone number in, as long as there are 10 digits in it. This would relieve the user completely from having to worry about the format, but it probably would make us have to do a bit more work to extract these digits later on. A regular expression for this might be as follows:

.*[0-9]{3,3}.*[0-9]{3,3}.*[0-9]{4,4}

As mentioned in previous sections, this might not be the most efficient regular expression because the ".*" sequence will pretty much guarantee some greedy searching problems; for infrequent form validation, however, it should not stress our servers significantly.

Matching Postal Codes

U.S. postal codes (Zip codes) are rather straightforward to validate with regular expressions. They are a sequence of five digits followed optionally by what is called the "plus 4," which is a dash character followed by four more digits. A regular expression for this is as follows:

[0-9]{5,5}([- ]?[0-9]{4,4})?

The first part of this regular expression, [0-9]{5,5}, is rather straightforward, but the second part, ([- ]?[0-9]{4,4})?, might seem a little less so. In effect, we have grouped the entire "plus 4" sequence with parentheses and qualified those with a ? character, saying they can optionally not exist, or exist once and only once. Inside that, we have said that this group optionally starts with either a dash or space (we are very forgiving) with [- ]?, and then we have said that there must be four more digits with [0-9]{4,4}.

Canadian postal codes, on the other hand, are quite straightforward to determine. They are always of the format X#X #X#, where # represents a digit and X a letter from the English alphabet. A regular expression for this would be as follows:

[A-Za-z][0-9][A-Za-z][:space:]*[0-9][A-Za-z][0-9]

We have been a little forgiving and let the user put any number of whitespace characters (including none) between the two blocks of three.

If we wanted to do a bit more research, however, we would realize that not all letters are valid in Canadian postal codes. For the first letter, in fact, only the letters in [ABCEGHJKLMNPRSTVXY] are valid. We could rewrite our regular expression as follows:

[ABCEGHJKLMNPRSTVXYabceghjklmnprstvxy][0-9][A-Za-z]
  [:space:]*[0-9][A-Za-z][0-9]

(We have split the above regular expression onto two lines for formatting purposes only.)

Matching E-Mail Addresses

A much more complicated example comes when we consider matching e-mail addresses. These come in a number of formats, some of which are extremely complicated. We will want to write a regular expression to verify at least the most common formats.

An e-mail address consists of three basic parts: the username, the @ symbol, and the domain name with which that username is associated:

username@domainname

The username, in its basic form, can consist of ASCII alphanumeric characters, underscores, periods, and dashes, describable by the regular expression: [[:alnum:]._-]+. In more complicated formats, it can be any sequence of characters enclosed in double quotes, and it can even include backslashes to escape seemingly invalid characters such as spaces and other backslashes.

We will limit ourselves to the most basic scenario for this sample and invite readers to look at the documentation in RFC 3696 (http:///www.ietf.org/rfc/rfc3696.txt) and RFC 2822 (http:///www.ietf.org/rfc/rfc2822.txt) for complete details of all possible e-mail address formats.

The domain name is a series of alphanumeric words, separated by periods. There cannot be a period before the first word or after the last word. In addition to alphanumeric characters, the words can contain the dash character. The last word in the domain name, such as com, edu, org, jp, or biz, will not contain a dash. Our regular expression for this might be as follows:

[[:alnum:]-]+\.([[:alnum:]-]+\.)*[[:alnum:]]+

The optional block in the middle, along with the * (which is the same as the {0, } quantifier), lets us insert arbitrary numbers of subdomains and associated dot characters into our domain name. The preceding regular expression correctly matches domains such as these:

example.org
www.example.org
shoes.example.org
pumps.shoes.example.org
my.little.furry.happy-bunny.is.cute.example.org
some.example.bizness

So, with all of these pieces, we now have a complete regular expression to look for a well-formed (syntactically, at least) e-mail address:

[[:alnum:]._-]+@[[:alnum:]-]+\.([[:alnum:]-]+\.)*[[:alnum:]]+

You are encouraged to try other regular expressions on your own to match things you see on a regular basis, such as URLs, credit card numbers, or license plate numbers in your home area. A key tip to help you with this is to break your regular expressions into subproblems, and solve all of those, before putting them together into one larger expression. If you try to solve the entire problem from the start, a small error is likely to sink your entire expression and be much more difficult to find.

Table of Contents