CIT 042 Index > Regular Expressions

Regular Expressions

If there’s one thing that humans do well, it’s pattern matching. You can categorize the numbers in the following list with barely any thought:

321-40-0909
302-555-8754
3-15-66
95124-0448

You can tell at a glance which of the following words can’t possibly be valid English words by the pattern of consonants and vowels:

grunion vortenal pskov trebular elibm talus

Regular expressions are Perl’s method of letting your program look for patterns:

The Simplest Patterns

The simplest pattern to look for is a single letter. If you want to see if a variable $x contains the letter e, for example, you can use this code:

$x = <STDIN>;
if ($x =~ m/e/)
{
    print "$x contains the letter e.\n";
}
else
{
    print "$x does not contain the letter e.\n";
}

The =~ means "contains the pattern"; the pattern itself is enclosed in slashes after the m operator, which stands for match. Note that you do not put quote marks around your pattern!

Of course, you can put more than one letter in your pattern. You can look for the word eat anywhere in a word:

$x =~ m/eat/

This will successfully match the words eat, heater, and treat, but won’t match easy, metal, or hat. You may be saying, "So what? I can do the same thing with the index string function." Yes, you can, but now let’s do something that isn’t so easy to do with index:

Matching any single character

Let’s make a pattern that will match the letter e followed by any character at all, followed by the letter t. To say "any character at all", you use a dot. Here’s the pattern:

$x =~ m/e.t/

This will match better, either, and best (the dot will match the t, i, and s in those words). It will not match beast (two letters between the e and t), ketch (no letters between the e and t), or crease (no letter t at all!).

Matching classes of characters

Now let’s find out how to narrow down the field a bit. We’d like to be able to find a pattern consisting of the letter b, any vowel (a, e, i, o, or u), followed by the letter t. To say "any one of a certain series of characters", you enclose them in square brackets:

$x =~ m/b[aeiou]t/

This matches words like bat, bet, rabbit, robotic, and abutment. It won’t match boot, because there are two letters between the b and t, and the class matches only a single character. (We’ll see how to check for multiple vowels later.)

There are abbreviations for establishing a series of letters: [a-f] is the same as [abcdef]; [A-Gm-p] is the same as [ABCDEFGmnop]; [0-9] matches a single digit (same as [0123456789]).

You may also complement (negate) a class; you can look for the letter e followed by anything except a vowel, followed by the letter t; or any character except a capital letter:

$x =~ m/e[^aeiou]t/
$x =~ m/[^A-Z]/

There are some classes that are so useful that Perl provides quick and easy abbrevations:

AbbreviationMeansSame as
\da digit[0-9]
\w a "word" character; uppercase letter, lowercase letter, digit, or underscore. This is actually more like a variable name character, but let’s not quibble. [A-Za-z0-9_]
\s a "whitespace" character [ \r\t\n\f]
And their complements...
\Da non-digit[^0-9]
\W a non-word character [^A-Za-z0-9_]
\S a non-whitespace character [^ \r\t\n\f]

Thus, this pattern matches a Social Security number; again, we’ll see a shorter way later on.

$x =~ m/\d\d\d-\d\d-\d\d\d\d/

Anchors

All the patterns we’ve seen so far will find a match anywhere within a string, which is usually - but not always - what we want. For example, we might insist on a capital letter, but only as the very first character in the string. Or, we might say that an employee ID number has to end with a digit. Or, we might want to find the word go only if it is at the beginning of a word, so that we will find it in You met another, and pfft you was gone., but we won’t mistakenly find it in I forgot my umbrella. This is the purpose of an anchor; to make sure that we are at a certain boundary before we continue the match. Unlike character classes, which match individual characters in a string, these anchors do not match any character; they simply establish that we are on the correct boundaries.

The up-arrow ^ matches the beginning of a line, and the dollar sign $ matches the end of a line. Thus, ^[A-Z] matches a capital letter at the beginning of the line. Note that if we put the ^ inside the square brackets, that would mean something entirely different! A pattern \d$ matches a digit at the end of a line. These are the boundaries you will use most often; sometimes you can have a string with multiple lines in it (since it will contain \n newlines). In that case, you may want to use \A and \Z to indicate that the next characters must be at the beginning or end of the entire string.

The other two anchors are \b and \B, which stand for a "word boundary" and "non-word boundary". For example, if we want to find the word met at the beginning of a word, we write the pattern /\bmet/, which will match The metal plate and The metropolitan lifestyle, but not Wear your bike helmet. The pattern /ing\b/ will match Hiking is fun and Reading, writing, and arithmetic, but not Gold ingots are heavy. Finally,the pattern /\bhat\b/ matches only the The hat is red but not That is the question or she hates anchovies or the shattered glass.

While \b is used to find the breakpoint between words and non-words, \B finds pairs of letters or nonletters; /\Bmet/ and /ing\B/ match the opposite examples of the preceding paragraph; /\Bhat\B/ matches only the shattered glass.

Repetition

All of these classes match only one character; what if we want to match three digits in a row, or an arbitrary number of vowels? You can follow any class or character by a repetition count:

PatternMatches
/b[aeiou]{2}t/ b followed by two vowels, followed by t
/A\d{3,}/ The letter A followed by 3 or more digits
/[A-Z]{,5}/ Zero to five capital letters
/\w{3,7}/ Three to seven word characters

This lets us rewrite our social security number pattern match as /\d{3}-\d{2}-\d{4}/.

There are three repetitions that are so common that Perl has special symbols for them: * means "zero or more," + means "one or more," and ? means "zero or one". Thus, if you want to look for lines consisting of last names followed by a first initial, you’d use this pattern:

/^\w+,\s*[A-Z]$/

This matches, starting at the beginning of the line, a word of one or more characters followed by an optional comma, zero or more spaces, and a single capital letter, which must be at the end of the line.

Grouping

So far so good, but what if we want to scan for a last name, followed by an optional comma-whitespace-initial; thus matching only a last name like "Smith" or a full "Smith, J"? We need to put the comma, whitespace, and initial into a unit with parentheses:

/^\w+(,\s*[A-Z])?$/

There’s a side effect of grouping - whenever we use parentheses to group something, the match operation stores the matched area in a buffer which we can access later on. Let’s put a group around the last name as well:

/^(\w+)(,\s*[A-Z])?$/

The last name that is matched goes into buffer number 1, and the comma-and-initial go into buffer number 2. We access them after the match with variables $1 and $2.

print "Enter name: ";
while ($info = <STDIN>)
{
    chomp $info;
    if ($info =~ m/^(\w+)(,\s*[A-Z])?$/)
    {
        print "Last name is $1\n";
        if ($2 ne "")
        {
            print "Initial is $2\n";
        }
    }
    else
    {
        print "Name not in proper format.\n";
    }
    print "Next name: ";
}

Here’s a sample run of this program:

Enter name: Smith
Last name is Smith
Next name: Smith, J
Last name is Smith
Initial is , J
Next name: Smith John
Name not in proper format.

Oops. That second one isn’t what we want. The group stores the entire matched substring, which includes the comma. We’d like to store only the initial. We can do this two ways. First, we can include yet another set of parentheses:

m/^(\w+)(,\s*([A-Z]))?$/

If we do it this way, then the capital letter is stored in $3 and the entire comma-and-initial is stored in $2. The other way to do this is to say that the outer parentheses should group, but not store any result, and we do that with a question mark and colon.

m/^(\w+)(?:,\s*([A-Z]))?$/

In this case, the initial is in $2, since the second open parentheses doesn’t use up one of the buffers. As you can see, patterns can very quickly become difficult to read.

There’s another way to store the buffers found by a match. Let’s say we want to match a phone number and find the area code, prefix, and number. Note that when we want to match to a real parenthesis, we have to precede it with a backslash to make it "not part of a group". We can do it this way:

$data =~ m/\((\d{3})\)\s*\d{3}-\d{4}/;
$area_code = $1;
$prefix = $2;
$number = $3;
print "Area code is $area_code\n";

Or we can assign the results of the match to a list on the left hand side of an equal sign:

($area_code, $prefix, $number) =
   ($data =~ m/\((\d{3})\)\s*\d{3}-\d{4}/);
print "Area code is $area_code\n";

Modifiers

You may follow a pattern by a modifier letter; the two that we’ll examine here are i and g. The i modifier gives a case-insensitive match. Thus, this pattern will match the word fish in any combination of upper and lower case, even FiSh

m/fish/i

The other useful modifier is the g modifier, which finds all the matches in a string. You use this in conjunction with arrays. The following statement will find all the sets of capital letter followed by an optional dash and a single digit, and store them in the array @results:

@results = ($info=~m/([A-Z]-?\d)/g);

Matching this pattern against the string:

"Insert tabs B3, D-7, and C6 into slot A9."

would fill the @results array with the strings "B3", "D-7", "C6" and "A9".

Greedy vs. Non-Greedy Matching

Let’s say you want to find the first word between double quotes in a string. In this example, you would want the pattern to match the word hog, and you would try this pattern:

$str = qq!The words "hog" and "pig" are synonyms.!;
($word) = $str =~ m/"(.*)"/;
print "$word\n";

If you run this program, you will be surprised to see its output:

hog" and "pig

Why did that happen? Because of greedy matching. When you have a * or +, Perl matches as many characters as it can. So, when you said "(.*)", Perl found the first quote mark, and then matched as many of “any character” as it could. That means Perl matched everything to the end of the string, and then it looked for the ending ". There isn’t a " at the end of the string, so the pattern matcher backed off and tried again. It kept backing up until it was finally able to make a match. The process looks sort of like this, where the characters matched by .* are in bold italic:

The words "hog" and "pig" are synonyms. # no quote mark after end of string
The words "hog" and "pig" are synonyms. # nope, still no quote mark after this
The words "hog" and "pig" are synonyms. # nope, still no quote mark after this
...
The words "hog" and "pig" are synonyms. # nope, still no quote mark after this
The words "hog" and "pig" are synonyms. # Finally! the .* is followed by a quote mark

In short, the rule for + and * is to be greedy; eat up as many characters as possible, then back off one character at a time until you can make a match (if a match can be made).

This is not the behavior we want in many cases, so there is a way to write the program to specify a non-greedy match, by putting a ? after the * or +. The program now looks like this:

$str = qq!The words "hog" and "pig" are synonyms.!;
($word) = $str =~ m/"(.*?)"/;
print "$word\n";

When run this program, the pattern matcher starts by matching zero characters and seeing if there is a match for the regular expression. If not, it extends the match one character at a time. Here’s what it looks like:

The words "hog" and "pig" are synonyms. # no quote mark after the first quote mark
The words "hog" and "pig" are synonyms. # no quote after the first letter
The words "hog" and "pig" are synonyms. # no quote after two letters
...
The words "hog" and "pig" are synonyms. # Success!

And that’s what you want. Note: another method that would have worked would have been to write the pattern this way, which means to look for a quote mark, followed by zero or more things-that-aren’t-quote-marks, followed by another quote mark.

($word) = $str =~ m/"([^"]*)"/;

Even though it is greedy matching, it works, because it’s not eating up as many of “anything” as possible, it’s eating up as many non-quote-marks as possible.

Substituting

It is possible to use regular expression to match part of a string and replace it with another string. The generic form of a substitution is as follows.

$string =~ s/pattern/replacement

The string is matched against the pattern. If there’s a match, then the part that was matched will be replaced by replacement. Here is a simple example:

my $str = "Teh feline in teh fedora";
$str =~ s/feline/cat/;
$str =~ s/fedora/hat/;
print($str, "\n");

The first substitute will match feline and replace it with cat; the second substitution will replace fedora with hat. The resulting string will be Teh cat in teh hat

You can use grouping to match particular parts of a string. For example, if you need to replace all occurrences of Teh or teh with the proper spelling, you could add this code:

$str =~ s/([Tt])eh/$1he/g;
print($str, "\n");

The grouping parentheses “remember” whether the word begins with a T or a t, and the $1 in the replacement string puts in the corresponding letter. Notice the g option at the end of the regular expression, which indicates that Perl should look for and substitute all the matches in the string.

Note: The replacement string is treated as though it were enclosed in double quotes, which is why the $1 interpolates properly.

Given this information, here’s a program that converts names in the form Joe Doakes to Doakes, J., and leaves anything not in the proper format alone.

my $info;  
  
print "Type a name, or just press ENTER to quit: ";
chomp($info = <STDIN>);
while ($info =~ /\S/)
{
  $info =~ s/^([A-Z])[a-z]+\s+([A-Z][a-z]+)$/$2, $1./;
  print $info, "\n";
  print "Next name: ";
  chomp($info = <STDIN>);
}