CIT 052 Index > Regular Expressions

Regular Expressions

If there’s one thing that humans do well, it’s pattern matching. You can categorize the numbers in the following list with barely any thought:

321-40-0909
302-555-8754
3-15-66
95124-0448

You can tell at a glance which of the following words can’t possibly be valid English words by the pattern of consonants and vowels:

grunion vortenal pskov trebular elibm talus

Regular expressions are grep’s method of letting you look for patterns in a file:

The Simplest Patterns

The simplest pattern to look for is a word or words. If you want to see if a file data.txt contains the words Joe Smith, for example, you can use this command:

grep 'Joe Smith' data.txt

Notice the quotemarks around Joe Smith to prevent the shell from thinking the blank separates command options or parameters.

Matching any single character

Let’s make a pattern that will match the letter e followed by any character at all, followed by the letter t. To say "any character at all", you use a dot. Here’s the pattern:

grep 'e.t' data.txt

This will match better, either, and best (the dot will match the t, i, and s in those words). It will not match beast (two letters between the e and t), ketch (no letters between the e and t), or crease (no letter t at all!).

Matching classes of characters

Now let’s find out how to narrow down the field a bit. We’d like to be able to find a pattern consisting of the letter b, any vowel (a, e, i, o, or u), followed by the letter t. To say "any one of a certain series of characters", you enclose them in square brackets:

grep 'b[aeiou]t' data.txt/

This matches lines with words like bat, bet, rabbit, robotic, and abutment. It won’t match boot, because there are two letters between the b and t, and the class matches only a single character. (We’ll see how to check for multiple vowels later.)

There are abbreviations for establishing a series of letters: [a-f] is the same as [abcdef]; [A-Gm-p] is the same as [ABCDEFGmnop]; [0-9] matches a single digit (same as [0123456789]).

You may also complement (negate) a class; you can look for the letter e followed by anything except a vowel, followed by the letter t; or any character except a capital letter:

grep 'e[^aeiou]t' data.txt
grep '[^A-Z]' data.txt

There are some classes that are so useful that the POSIX standard supplies quick and easy abbrevations, among them:

AbbreviationMeans
[:digit:]a digit
[:alpha:] Alphabetic characters
[:space:] a "whitespace" character
[:upper:] Uppercase letters

The square brackets are part of the abbreviation, so when you use them inside a character class specification, you will end up with two sets of brackets. Thus, this pattern matches three alphabetic characters (we’ll see a better way later on)..

grep '[[:alpha:]][[:alpha:]][[:alpha:]]' data.txt

Anchors

All the patterns we’ve seen so far will find a match anywhere within a line, which is usually - but not always - what we want. For example, we might insist on a capital letter, but only as the very first character in the string. Or, we might say that an employee ID number has to end with a digit. Or, we might want to find the word go only if it is at the beginning of a word, so that we will find it in You met another, and pfft you was gone., but we won’t mistakenly find it in I forgot my umbrella. This is the purpose of an anchor; to make sure that we are at a certain boundary before we continue the match. Unlike character classes, which match individual characters in a string, these anchors do not match any character; they simply establish that we are on the correct boundaries.

The up-arrow ^ matches the beginning of a line, and the dollar sign $ matches the end of a line. Thus, ^[A-Z] matches a capital letter at the beginning of the line. Note that if we put the ^ inside the square brackets, that would mean something entirely different!

A pattern [0-9]$ matches a digit at the end of a line. These are the boundaries you will use most often.

The other anchor is \b, which stands for a "word boundary". For example, if we want to find the word met at the beginning of a word, we write the pattern '\bmet', which will match The metal plate and The metropolitan lifestyle, but not Wear your bike helmet. The pattern 'ing\b' will match Hiking is fun and Reading, writing, and arithmetic, but not Gold ingots are heavy. Finally,the pattern '\bhat\b' matches only the The hat is red but not That is the question or she hates anchovies or the shattered glass.

Repetition

All of these classes match only one character; what if we want to match three digits in a row, or an arbitrary number of vowels? You can follow any class or character by a repetition count. (From here on, we will leave off the quote marks around the patterns. When you put them into your grep command, you should put quotes around the pattern.)

PatternMatches
b[aeiou]\{2\}t b followed by two vowels, followed by t
[[:alpha:]]\{3\} Three alphabetic characters
A[0-9]\{3,\} The letter A followed by 3 or more digits
[A-Z]\{0,5\} Zero to five capital letters
[[:alpha:]]\{3,7\} Three to seven word characters

Notice that you need a \ (backslash) before the beginning and ending braces when using grep. When using egrep or grep -E, you do not need the backslashes.

This lets us write our social security number pattern match as \[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}/.

There are three repetitions that are so common that Perl has special symbols for them: * means "zero or more," \+ means "one or more," and \? means "zero or one". Thus, if you want to look for lines consisting of last names followed by a first initial, you’d use this pattern:

^[A-Za-z]\+,[[:space:]]*[A-Z]$

This matches, starting at the beginning of the line, a word of one or more alphabetic characters followed by an optional comma, zero or more spaces, and a single capital letter, which must be at the end of the line.

Note: In egrep, you do not need the backslash before the plus sign or question mark.

Grouping

So far so good, but what if we want to scan for a last name, followed by an optional comma-whitespace-initial; thus matching only a last name like "Smith" or a full "Smith, J"? We need to put the comma, whitespace, and initial into a unit with parentheses, preceded by backslashes in grep but not in egrep:

^[A-Za-z]\+\(,[[:space:]]*[A-Z]\)\?$/

There’s a side effect of grouping - whenever we use parentheses to group something, the match operation stores the matched area in a buffer which we can access later on in the match. For example, let’s say you want to find all lines with repeated words on them. You type this:

\([A-Za-z]\+\)[[:space:]]\1

This says to look for:

Thus, this will find lines with repeated words like:

Paris in the the spring.
This is very very important.

You can see exactly what grep matched by using the --only-matching option. Try putting this in a file named data.txt

No duplicate words here.
This does not have too too much on it.
Paris in the the spring.
A sentence with all different words.
I sang it again and again.

Run these commands to see what the --only-matching does.

grep '\([A-Za-z]\+\)[[:space:]]\1' data.txt
grep --only-matching '\([A-Za-z]\+\)[[:space:]]\1' data.txt
grep --color '\([A-Za-z]\+\)[[:space:]]\1' data.txt

If you are using Mandriva Linux, it has set an alias to make grep automatically use the --color option, which also lets you see what was matched.