Приглашаем посетить
Html (html.find-info.ru)

Section 15.3.  Regexp Special Characters

Previous
Table of Contents
Next

15.3. Regexp Special Characters

The metacharacters +, *, ?, and { } affect the number of times a pattern should be matched, ( ) allows you to create subpatterns, and $ and ^ affect the position. + means "Match one or more of the previous expression," * means "Match zero or more of the previous expression," and ? means "Match zero or one of the previous expression." For example:

    preg_match("/[A-Za-z ]*/", $string);
    // matches "", "a", "aaaa", "The sun has got his hat on", etc

    preg_match("/-?[0-9]+/", $string);
    // matches 1, 100, 324343995, and also -1, -234011, etc. The "-?" means "match exactly
     0 or 1 minus symbols"

This next regexp shows two character classes, with the first being required and the second optional. As mentioned before, $ is a regexp symbol in its own right; however, here we precede it with a backslash, which works as an escape character, turning the $ into a standard character and not a regexp symbol. We match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. If you're able to parse this in your head, you will see that this regexp will match PHP variable names:

    preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);

Table 15-3 shows a list of regular expressions using +, *, and ?, and whether or not a match is made.

Table 15-3. Regular expressions using +, *, and ?

Regexp

Result

preg_match("/[A-Z]+/", "123")

False

preg_match("/[A-Z][A-Z0-9]+/i", "A123")

True

preg_match("/[0-9]?[A-Z]+/", "10GreenBottles")

True; matches "0G"

preg_match("/[0-9]?[A-Z0-9]*/i", "10GreenBottles")

True

preg_match("/[A-Z]?[A-Z]?[A-Z]*/", "")

True; zero or one match, then zero or one match, then zero or more means that an empty string matches


Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. First, {n}, where n is a positive number, will match n instances of the previous expression. Second, {n,} will match a minimum of n instances of the previous expression. Third, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.

Table 15-4 shows a list of regular expressions using braces, and whether or not a match is made.

Table 15-4. Regular expressions using braces

Regexp

Result

preg_match("/[A-Z]{3}/", "FuZ")

False; the regexp will match precisely three uppercase letters

preg_match("/[A-Z]{3}/i", "FuZ")

True; same as above, but case-insensitive this time

preg_match("/[0-9]{3}-[0-9]{4}/", "555-1234")

True; precisely three numbers, a dash, then precisely four. This will match local U.S. telephone numbers, for example

preg_match("/[a-z]+[0-9]?[a-z]{1}/", "aaa1")

True; must end with one lowercase letter

preg_match("/[A-Z]{1,}99/", "99")

False; must start with at least one uppercase letter

preg_match("/[A-Z]{1,5}99/", "FINGERS99")

True; "S99", "RS99", "ERS99", "GERS99", and "NGERS99" all fit the criteria

preg_match("/[A-Z]{1,5}[0-9]{2}/i", "adams42")

True


Parentheses inside regular expressions allow you to define subpatterns that should be matched individually. The most common use for these is to specify groups of alternatives for matches, allowing you to match very specific criteria. For example, "the (cat|car) sat on the (mat|drive)" would match "the cat sat on the mat", "the car sat on the mat", "the cat sat on the drive", and "the car sat on the drive". You can use as many alternatives as you want, so "the (car|cat|bat|bull|wool|white paint) sat on the (mat|drive)" could match many sentences.

Table 15-5 shows a list of regular expressions using parentheses, and whether or not a match is made.

Table 15-5. Regular expressions using braces

Regexp

Result

print preg_match("/(Linux|Mac OS X)/", "Linux")

True

print preg_match("/(Linux|Mac OS X){2}/", "Mac OS XLinux")

True

print preg_match("/(Linux|Mac OS X){2}/", "Mac OS X Linux")

False; there's a space in there, which is not part of the regexp

preg_match("/contra(diction|vention)/", "contravention")

True

preg_match("/Windows ([0-9][0-9] +|Me|XP)/", "Windows 2000")

True; matches 95, 98, 2000, 2003, Me, and XP

preg_match("/Windows (([0-9][0-9] +|Me|XP)|Codename (Whistler|Longhorn))/", "Windows Codename Whistler")

True; uses nested subpatterns to match all versions of Windows, but also codenames


Finally, we have the dollar $ and caret ^ symbols, which mean "end of line" and "start of line," respectively. Consider the following string:

    $multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand
    the\ncaret symbol\nwork as planned";

As you know, \n means "new line," so that is a string containing the following text:

This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned

In order to parse multiline strings, we need the m modifier, so m needs to go after the final slash. Without m, our multiline string is treated as only being one line, with "This" at the start of the line and "planned" at the end. By adding "m" to the regexp, we're asking PHP to match $ and ^ against the start and end of each line wherever the newline (\n) character is. All of these code snippets return true:

    preg_match("/is$/m", $multitest);
    // returns true if 'is' is at the end of a line

    preg_match("/the$/m", $multitest);
    // returns true if 'the' is at the end of a line

    preg_match("/^the/m", $multitest);
    // returns true if 'the' is at the end of a line

    preg_match("/^Symbol/m", $multitest);
    // returns true if 'Symbol' is at the start of a line

    preg_match("/^[A-Z][a-z]{1,}/m", $multitest);
    // returns true if there's a capital and one or more lowercase letters at line start

As explained, without the m modifier, the $ and ^ metacharacters only match the start and end of the entire string. With m, $ and ^ match the start and end of each new line. If you want to get the start and end of the string when m is enabled, you should use \A and \z, like this:

    preg_match("/\AThis/m", $multitest);
    // returns true if the string starts with "This" (true)

    preg_match("/symbol\z/m", $multitest);
    // returns true if the string ends with "symbol" (false)


Previous
Table of Contents
Next