CIT 041J Index > Regular Expressions

Regular Expressions

If there’s one thing that humans do well, it’s pattern matching. You can categorize the numbers in the following list with barely any thought:

321-40-0909
302-555-8754
3-15-66
95124-0448

You can tell at a glance which of the following words can’t possibly be valid English words by the pattern of consonants and vowels:

grunion vortenal pskov trebular elibm talus

Regular expressions are Javascript’s method of letting your program look for patterns:

The Simplest Patterns

The simplest pattern to look for is a single letter. If you want to see if a variable str contains the letter e, for example, you can use this code (presuming that user input is in a form field named textinput and that the results go into a field named feedback):

function test1( )
{
    var str = document.ex1.textInput.value;
    var pos = str.search( /e/ );
    if (pos >= 0)
    {
        alert( str + " contains the letter e at position "
            + pos );
    }
    else
    {
        alert(str + " does not contain the letter e.");
    }
}

<form name="ex1" action="#">
<p>
String: <input type="text" name="textInput" />
<input type="button" onclick="test1();" value="Test" />
</p>
</form>

String:

The pattern is enclosed in slashes. Note that you do not put quote marks around your pattern! Notice that a pattern search will find only the first occurrence of a pattern; try typing the word center and see where it finds the letter e.

The rest of the examples are using different code; just press TAB or ENTER after filling in the input to see results.

Of course, you can put more than one letter in your pattern. You can look for the word eat anywhere in a word:

str.search(/eat/)
Input:
Found /eat/ at position

This will successfully match the words eat, heater, and treat, but won’t match easy, metal, or hat. You may be saying, "So what? I can do the same thing with the index string function." Yes, you can, but now let’s do something that isn’t so easy to do with index:

Matching any single character

Let’s make a pattern that will match the letter e followed by any character at all, followed by the letter t. To say "any character at all", you use a dot. Here’s the pattern:

str.search(/e.t/)
Input:
Found /e.t/ at position

This will match better, either, and best (the dot will match the t, i, and s in those words). It will not match beast (two letters between the e and t), ketch (no letters between the e and t), or crease (no letter t at all!).

Matching classes of characters

Now let’s find out how to narrow down the field a bit. We’d like to be able to find a pattern consisting of the letter b, any vowel (a, e, i, o, or u), followed by the letter t. To say "any one of a certain series of characters", you enclose them in square brackets:

str.search(/b[aeiou]t/)
Input:
Found /b[aeiou]t/ at position

This matches words like bat, bet, rabbit, robotic, and abutment. It won’t match boot, because there are two letters between the b and t, and the class matches only a single character. (We’ll see how to check for multiple vowels later.)

There are abbreviations for establishing a series of letters: [a-f] is the same as [abcdef]; [A-Gm-p] is the same as [ABCDEFGmnop]; [0-9] matches a single digit (same as [0123456789]).

You may also complement (negate) a class; the next two searches will look for the letter e followed by anything except a vowel, followed by the letter t; or any character except a capital letter:

str.search(/e[^aeiou]t/)
str.search(/[^A-Z]/)

There are some classes that are so useful that JavaScript provides quick and easy abbrevations:

AbbreviationMeansSame as
\da digit[0-9]
\w a "word" character; uppercase letter, lowercase letter, digit, or underscore. This is actually more like a variable name character, but let’s not quibble. [A-Za-z0-9_]
\s a "whitespace" character (blank, newline, tab, and others) [ \r\t\n\f]
And their complements...
\Da non-digit[^0-9]
\W a non-word character [^A-Za-z0-9_]
\S a non-whitespace character [^ \r\t\n\f]

Thus, this pattern matches a Social Security number; again, we’ll see a shorter way later on.

str.search(/\d\d\d-\d\d-\d\d\d\d/)
Input:
Found /\d\d\d-\d\d-\d\d\d\d/ at position

Anchors

All the patterns we’ve seen so far will find a match anywhere within a string, which is usually - but not always - what we want. For example, we might insist on a capital letter, but only as the very first character in the string. Or, we might say that an employee ID number has to end with a digit. Or, we might want to find the word go only if it is at the beginning of a word, so that we will find it in You met another, and pfft you was gone., but we won’t mistakenly find it in I forgot my umbrella. This is the purpose of an anchor; to make sure that we are at a certain boundary before we continue the match. Unlike character classes, which match individual characters in a string, these anchors do not match any character; they simply establish that we are on the correct boundaries.

The up-arrow ^ matches the beginning of a line, and the dollar sign $ matches the end of a line. Thus, ^[A-Z] matches a capital letter at the beginning of the line. Note that if we put the ^ inside the square brackets, that would mean something entirely different! A pattern \d$ matches a digit at the end of a line. These are the boundaries you will use most often; sometimes you can have a string with multiple lines in it (since it will contain \n newlines). In that case, you may want to use \A and \Z to indicate that the next characters must be at the beginning or end of the entire string.

The other two anchors are \b and \B, which stand for a "word boundary" and "non-word boundary". For example, if we want to find the word met at the beginning of a word, we write the pattern /\bmet/, which will match The metal plate and The metropolitan lifestyle, but not Wear your bike helmet. The pattern /ing\b/ will match Hiking is fun and Reading, writing, and arithmetic, but not Gold ingots are heavy. Finally,the pattern /\bhat\b/ matches only the The hat is red but not That is the question or she hates anchovies or the shattered glass.

Input:
Found /\bmet/ at position
Found /ing\b/ at position
Found /\bhat\b/ at position

While \b is used to find the breakpoint between words and non-words, \B finds pairs of letters or nonletters; /\Bmet/ and /ing\b/ match the opposite examples of the preceding paragraph; /\Bhat\B/ matches only the shattered glass.

Repetition

All of these classes match only one character; what if we want to match three digits in a row, or an arbitrary number of vowels? You can follow any class or character by a repetition count:

PatternMatches
/b[aeiou]{2}t/ b followed by two vowels, followed by t
/A\d{3,}/ The letter A followed by 3 or more digits
/[A-Z]{,5}/ Zero to five capital letters
/\w{3,7}/ Three to seven word characters

This lets us rewrite our social security number pattern match as /\d{3}-\d{2}-\d{4}/.

There are three repetitions that are so common that JavaScript has special symbols for them: * means "zero or more," + means "one or more," and ? means "zero or one". Thus, if you want to look for lines consisting of last names followed by a first initial, you could use this pattern:

/^\w+,?\s*[A-Z]$/
Input:
Found /^\w+,?\s*[A-Z]$/ at position

This matches, starting at the beginning of the line, a word of one or more characters followed by an optional comma, zero or more spaces, and a single capital letter, which must be at the end of the line.

Grouping

So far so good, but what if we want to scan for a last name, followed by an optional comma-whitespace-initial; thus matching only a last name like "Smith" or a full "Smith, J"? We need to put the comma, whitespace, and initial into a unit with parentheses:

/^\w+(,\s*[A-Z])?$/
Input:
Found /^\w+(,\s*[A-Z])?$/ at position

Note: If you want to match a parenthesis, you have to precede it with a backslash to make it non-special.

Modifiers

If you want a pattern match to be case-insenstive, follow the closing slash of the pattern by a lowercase letter i. The following example shows a pattern that will match any Canadian postal code in upper or lower case:

/^[A-Z]\d[A-Z]\s+\d[A-Z]\d$/i
Input:
Found postal code at position

At this point, you know everything you need to test whether a string matches a particular pattern.

Advanced Pattern Matching

All we have done so far is testing to see whether a pattern matches or not. Now that you can match a person’s last name and initial, you might want to be able to grab them out of the string so that you can change Martinez, A to A. Martinez. To accomplish this, you will need something other than search().

Rather than doing a simple positional search, you use the match() method. You’ll also have to use the grouping parentheses, which have a side effect: whenever you use parentheses to group something, the match operation stores the part of the string that matched the group so that you can use it later on. Here’s how it works:

function getNames( )
{
   var str = document.nameForm.textInput.value;
   var pattern = /^(\w+)(,\s*[A-Z])?$/;
   
   var foundArray = str.match( pattern );
   
   if (foundArray != null)
   {
      alert("Found last name: " + foundArray[1] + "\n"
         + "Found initial: " + foundArray[2] );
   }
}

<form name="nameForm" action="#">
Input: <input type="text" name="textInput" />
<input type="button" value="Test" onclick="getNames( );" />
</form>
Input:

The result of match() is null if the pattern doesn’t match, or an array if it does. The first element of the array, in this case foundArray[0], contains everything that the pattern matched. foundArray[1] contains the part of the string that the first set of grouping parentheses matched, foundArray[2] contains the part of the string matched by the second set of grouping parentheses, and so forth.

If you enter Smith, J you’ll see that the second set of grouping parentheses doesn’t give you what you want. The group stores the entire matched substring, which includes the comma. We’d like to store only the initial. We can do this two ways. First, we can include yet another set of parentheses:

/^(\w+)(,\s*([A-Z]))?$/
Input:

If we do it this way, then the capital letter is stored in foundArray[3] and the entire comma-and-initial is stored in foundArray[2].

The other way to do this is to say that the outer parentheses should group but not store the matched portion in the result array. You do that with a question mark and colon.

/^(\w+)(?:,\s*([A-Z]))?$/
Input:

In this case, the initial is in foundArray[2], since the second open parentheses doesn’t get stored. As you can see, patterns can very quickly become difficult to read.

Here is another example. Say you want to match a phone number and find the area code, prefix, and number. Note that when you want to match to a real parenthesis, you have to precede it with a backslash to make it “not part of a group.” You can do it this way:

var areaCode = "";
var prefix = "";
var number = "";

function getPhone( )
{
   var str = document.phoneForm.textInput.value;
   var pattern = /\((\d{3})\)\s*(\d{3})-(\d{4})/;
   
   var foundArray = str.match( pattern );
   
   if (foundArray != null)
   {
      areaCode = foundArray[1];
      prefix = foundArray[2];
      number = foundArray[3];
      alert("Area code: " + areaCode
         + " Prefix: " + prefix
         + " Number: " + number );
   }
   else
   {
      alert(str + " is not a valid phone number.");
   }
}

<form name="phoneForm" action="#">
Input: <input type="text" name="textInput" />
<input type="button" value="Test" onclick="getNames( );" />
</form>
Input:

Modifiers

As you already have seen, you can put the letter i after a pattern to make the match case-insensitive. As mentioned near the beginning of this tutorial, pattern matches find only the first occurrence of a pattern. If you want to find all the matches in a string, use the g modifier, which stands for global match. You use this in conjunction with match(). The following pattern will find all the sets of capital letter followed by an optional dash and a single digit:

/([A-Z]-?\d)/g

Matching this pattern against the string:

"Insert tabs B3, D-7, and C6 into slot A9."

Produces results as shown in the following code. Note: when you do a global search, the first element of the resulting array will be the first match, not the entire substring that the pattern matched!

function showGlobalMatch( )
{
    var pattern = /([A-Z]-?\d)/g;
    var str = "Insert tabs B3, D-7, and C6 into slot A9.";
    var foundArray;
    var feedback = "";
    
    foundArray = str.match( pattern );
    if (foundArray != null)
    {
        for (var i=0; i < foundArray.length; i++)
        {
            feedback = feedback + "foundArray[" 
                + i + "] = " + foundArray[i] + "\n";
        }
    }
    else
    {
        feedback = "Pattern not found.";
    }
    alert(feedback);
}

<a href="#" onclick="showGlobalMatch(); return false;">See results</a>
See results