If there’s one thing that humans do well, it’s pattern matching. You can categorize the numbers in the following list with barely any thought:
321-40-0909 302-555-8754 3-15-66 95124-0448
You can tell at a glance which of the following words can’t possibly be valid English words by the pattern of consonants and vowels:
grunion vortenal pskov trebular elibm talus
Regular expressions are Javascript’s method of letting your program look for patterns:
The simplest pattern to look for is a single letter. If you want to
see if a variable str contains the letter e, for
example, you can use this code (presuming that user input is in a form
field named textinput and that the results go into
a field named feedback):
function test1( )
{
var str = document.ex1.textInput.value;
var pos = str.search( /e/ );
if (pos >= 0)
{
alert( str + " contains the letter e at position "
+ pos );
}
else
{
alert(str + " does not contain the letter e.");
}
}
<form name="ex1" action="#"> <p> String: <input type="text" name="textInput" /> <input type="button" onclick="test1();" value="Test" /> </p> </form>
The pattern is enclosed in slashes. Note that you do not put quote marks around your pattern! Notice that a pattern search will find only the first occurrence of a pattern; try typing the word center and see where it finds the letter e.
The rest of the examples are using different code; just press TAB or ENTER after filling in the input to see results.
Of course, you can put more than one letter in your pattern. You can
look for the word eat anywhere in a word:
str.search(/eat/)
This will successfully match the words eat, heater, and
treat, but won’t match easy, metal, or
hat. You may be saying, "So what? I can do the same thing with
the index string function." Yes, you can, but now let’s do
something that isn’t so easy to do with index:
Let’s make a pattern that will match the letter e
followed by any character at all, followed by the letter
t. To say "any character at all", you use a dot.
Here’s the pattern:
str.search(/e.t/)
This will match better, either, and best (the dot will match the t, i, and s in those words). It will not match beast (two letters between the e and t), ketch (no letters between the e and t), or crease (no letter t at all!).
Now let’s find out how to narrow down the field a bit. We’d like to be able to find a pattern consisting of the letter b, any vowel (a, e, i, o, or u), followed by the letter t. To say "any one of a certain series of characters", you enclose them in square brackets:
str.search(/b[aeiou]t/)
This matches words like bat, bet, rabbit, robotic, and abutment. It won’t match boot, because there are two letters between the b and t, and the class matches only a single character. (We’ll see how to check for multiple vowels later.)
There are abbreviations for establishing a series of letters:
[a-f] is the same as [abcdef];
[A-Gm-p] is the same as [ABCDEFGmnop];
[0-9] matches a single digit (same as [0123456789]).
You may also complement (negate) a class; the next two searches will look for the letter e followed by anything except a vowel, followed by the letter t; or any character except a capital letter:
str.search(/e[^aeiou]t/) str.search(/[^A-Z]/)
There are some classes that are so useful that JavaScript provides quick and easy abbrevations:
| Abbreviation | Means | Same as |
|---|---|---|
\d | a digit | [0-9] |
\w |
a "word" character; uppercase letter, lowercase letter, digit, or underscore. This is actually more like a variable name character, but let’s not quibble. | [A-Za-z0-9_] |
\s |
a "whitespace" character (blank, newline, tab, and others) | [ \r\t\n\f] |
| And their complements... | ||
\D | a non-digit | [^0-9] |
\W |
a non-word character | [^A-Za-z0-9_] |
\S |
a non-whitespace character | [^ \r\t\n\f] |
Thus, this pattern matches a Social Security number; again, we’ll see a shorter way later on.
str.search(/\d\d\d-\d\d-\d\d\d\d/)
All the patterns we’ve seen so far will find a match anywhere within a string, which is usually - but not always - what we want. For example, we might insist on a capital letter, but only as the very first character in the string. Or, we might say that an employee ID number has to end with a digit. Or, we might want to find the word go only if it is at the beginning of a word, so that we will find it in You met another, and pfft you was gone., but we won’t mistakenly find it in I forgot my umbrella. This is the purpose of an anchor; to make sure that we are at a certain boundary before we continue the match. Unlike character classes, which match individual characters in a string, these anchors do not match any character; they simply establish that we are on the correct boundaries.
The up-arrow ^ matches the beginning of a line, and
the dollar sign $ matches the end of a line. Thus,
^[A-Z] matches a capital letter at the beginning of the
line. Note that if we put the ^ inside the
square brackets, that would mean something entirely different!
A pattern \d$ matches a digit at the end of a line.
These are the boundaries you will use most often; sometimes you can have
a string with multiple lines in it (since it will contain \n
newlines). In that case, you may want to use \A and
\Z to indicate that the next characters must
be at the beginning or end of the entire string.
The other two anchors are \b and \B, which stand for
a "word boundary" and "non-word boundary". For example, if we want
to find the word met at the beginning of a word, we write
the pattern /\bmet/, which will match
The metal plate and The metropolitan lifestyle, but not Wear your bike helmet.
The pattern /ing\b/ will match Hiking is fun and
Reading, writing, and arithmetic, but not Gold ingots
are heavy. Finally,the pattern /\bhat\b/ matches only
the The hat is red but not That is the question or
she hates anchovies or
the shattered glass.
While \b is used to find the breakpoint between words and
non-words, \B finds pairs of letters or nonletters;
/\Bmet/ and /ing\b/ match the opposite
examples of the preceding paragraph;
/\Bhat\B/ matches only the shattered glass.
All of these classes match only one character; what if we want to match three digits in a row, or an arbitrary number of vowels? You can follow any class or character by a repetition count:
| Pattern | Matches |
|---|---|
/b[aeiou]{2}t/ |
b followed by two vowels, followed by
t |
/A\d{3,}/ |
The letter A followed by 3 or more
digits |
/[A-Z]{,5}/ |
Zero to five capital letters |
/\w{3,7}/ |
Three to seven word characters |
This lets us rewrite our social security number pattern match
as /\d{3}-\d{2}-\d{4}/.
There are three repetitions that are so common that JavaScript has
special symbols for them: * means "zero or more,"
+ means "one or more," and
? means "zero or one". Thus, if you want to look
for lines consisting of last names followed by a first initial,
you could use this pattern:
/^\w+,?\s*[A-Z]$/
This matches, starting at the beginning of the line, a word of one or more characters followed by an optional comma, zero or more spaces, and a single capital letter, which must be at the end of the line.
So far so good, but what if we want to scan for a last name, followed by an optional comma-whitespace-initial; thus matching only a last name like "Smith" or a full "Smith, J"? We need to put the comma, whitespace, and initial into a unit with parentheses:
/^\w+(,\s*[A-Z])?$/
Note: If you want to match a parenthesis, you have to precede it with a backslash to make it non-special.
If you want a pattern match to be case-insenstive, follow the closing
slash of the pattern by a lowercase letter i. The
following example shows a pattern that will match any Canadian postal
code in upper or lower case:
/^[A-Z]\d[A-Z]\s+\d[A-Z]\d$/i
At this point, you know everything you need to test whether a string matches a particular pattern.
All we have done so far is testing to see whether a pattern matches
or not. Now that you can match a person’s last name and initial, you
might want to be able to grab them out of the string so that you can
change Martinez, A to A. Martinez. To accomplish this,
you will need something other than search().
Rather than doing a simple positional search, you use the
match() method. You’ll
also have to use the grouping parentheses, which have a
side effect: whenever you use parentheses to
group something, the match operation stores the part of the
string that matched the group so that you can use it later on.
Here’s how it works:
function getNames( )
{
var str = document.nameForm.textInput.value;
var pattern = /^(\w+)(,\s*[A-Z])?$/;
var foundArray = str.match( pattern );
if (foundArray != null)
{
alert("Found last name: " + foundArray[1] + "\n"
+ "Found initial: " + foundArray[2] );
}
}
<form name="nameForm" action="#"> Input: <input type="text" name="textInput" /> <input type="button" value="Test" onclick="getNames( );" /> </form>
The result of match() is null if the
pattern doesn’t match, or an array if it does.
The first element
of the array, in this case foundArray[0], contains
everything that the pattern matched.
foundArray[1] contains the part of the string that the
first set of grouping parentheses matched, foundArray[2] contains
the part of the string matched by the second set of grouping parentheses,
and so forth.
If you enter Smith, J you’ll see that the second set of grouping parentheses doesn’t give you what you want. The group stores the entire matched substring, which includes the comma. We’d like to store only the initial. We can do this two ways. First, we can include yet another set of parentheses:
/^(\w+)(,\s*([A-Z]))?$/
If we do it this way, then the capital letter is stored in
foundArray[3]
and the entire comma-and-initial is stored in
foundArray[2].
The other way to do this is to say that the outer parentheses should group but not store the matched portion in the result array. You do that with a question mark and colon.
/^(\w+)(?:,\s*([A-Z]))?$/
In this case, the initial is in
foundArray[2], since the second open parentheses
doesn’t get stored. As you can see, patterns can
very quickly become difficult to read.
Here is another example. Say you want to match a phone number and find the area code, prefix, and number. Note that when you want to match to a real parenthesis, you have to precede it with a backslash to make it “not part of a group.” You can do it this way:
var areaCode = "";
var prefix = "";
var number = "";
function getPhone( )
{
var str = document.phoneForm.textInput.value;
var pattern = /\((\d{3})\)\s*(\d{3})-(\d{4})/;
var foundArray = str.match( pattern );
if (foundArray != null)
{
areaCode = foundArray[1];
prefix = foundArray[2];
number = foundArray[3];
alert("Area code: " + areaCode
+ " Prefix: " + prefix
+ " Number: " + number );
}
else
{
alert(str + " is not a valid phone number.");
}
}
<form name="phoneForm" action="#"> Input: <input type="text" name="textInput" /> <input type="button" value="Test" onclick="getNames( );" /> </form>
As you already have seen, you can put the letter i after
a pattern to make the match case-insensitive.
As mentioned near the beginning of this tutorial, pattern matches
find only the first occurrence of a pattern. If you want to find
all the matches in a string, use the g modifier,
which stands for global match.
You use this in conjunction
with match(). The following pattern will find all the
sets of capital letter followed by an optional
dash and a single digit:
/([A-Z]-?\d)/g
Matching this pattern against the string:
"Insert tabs B3, D-7, and C6 into slot A9."
Produces results as shown in the following code. Note: when you do a global search, the first element of the resulting array will be the first match, not the entire substring that the pattern matched!
function showGlobalMatch( )
{
var pattern = /([A-Z]-?\d)/g;
var str = "Insert tabs B3, D-7, and C6 into slot A9.";
var foundArray;
var feedback = "";
foundArray = str.match( pattern );
if (foundArray != null)
{
for (var i=0; i < foundArray.length; i++)
{
feedback = feedback + "foundArray["
+ i + "] = " + foundArray[i] + "\n";
}
}
else
{
feedback = "Pattern not found.";
}
alert(feedback);
}
<a href="#" onclick="showGlobalMatch(); return false;">See results</a>See results