CIT 052 Index > A Quick Introduction to awk

A Quick Introduction to awk

awk lets you manipulate files that consist of fields separated by delimiters. One example of such a file is this file, people.txt, which contains information from the Dead People Server and Wikipedia:

Adams, Ansel;photographer;1902-02-20;1984-04-22
Asimov, Isaac;author;1920-01-02;1992-04-06
Janney, Allison;actress;1959-11-19
La Rue, Lash;actor;1917-06-15;1996-05-21
Sagan, Carl;astronomer/writer;1934-11-09;1996-12-20
Sharif, Omar;actor;1932-04-10

In this case, the fields are separated by semicolons, and the fields represent, from left to right, the person's name, occupation, date of birth, and date of death (if dead).

The awk Processing Cycle

The most important thing to know about awk is the steps it takes when processing a file.

  1. Read a line from the file into a variable named $0.
  2. Split up the fields. The awk command can be invoked with an option that tells what the delimiter is. If you don’t give a delimiter, then fields are delimited by whitespace.

    The first field is placed in variable $1, the second in $2, and so forth

  3. Do whatever command or commands are in the braces { and }
  4. Lather, rinse, repeat.

Delimiters

If we run awk on the people.txt with just whitespace as delimiters, we will end up with the following split of fields:

$1$2
Adams,Ansel;photographer;1902-02-20;1984-04-22
Asimov,Isaac;author;1920-01-02;1992-04-06
Janney,Allison;actress;1959-11-19
LaRue,
Sagan,Carl;astronomer/writer;1934-11-09;1996-12-20
Sharif,Omar;bridge

Notice that the blank in Lash La Rue’s name and the blank in bridge expert started a new field! Clearly, that's not what we want; we would like to have the fields separated by semicolons, so if we run awk, specifying the input field separator delimiter:

awk -F';' '{some command}' people.txt

We get this split of fields:

$1$2$3$4
Adams, Anselphotographer1902-02-201984-04-22
Asimov, Isaacauthor1920-01-021992-04-06
Janney, Allisonactress1959-11-19 
La Rue, Lashactor1917-06-151996-05-21
Sagan, Carl astronomer/writer1934-11-091996-12-20
Sharif, Omarbridge expert/actor1932-04-10

Simple Printing

Let’s say I wanted a list of all the people and their date of birth. I would write an awk command like this:

awk -F';' '{print $1, "was born", $3 "."}' people.txt

producing the following output. Notice that putting commas between the arguments to print produces a blank between the output fields. Not putting a comma between print arguments (the $3 and ".") makes them print out right next to each other

Adams, Ansel was born 1902-02-20.
Asimov, Isaac was born 1920-01-02.
Janney, Allison was born 1959-11-19.
La Rue, Lash was born 1917-06-15.
Sagan, Carl was born 1934-11-09.
Sharif, Omar was born 1932-04-10.

More about Delimiters

If I wanted the name in the proper order and just the year of birth, I would split the fields on semicolon (to get the main entries), comma (to get the first and last names separated), and dash (to get the dates separated). You specify all the delimiters in square brackets, like a grep character class. If dash is one of your delimiters, you must place it either first or last in the square brackets.

awk -F'[;,-]' '{some command}' people.txt

Gives me this split of fields:

$1$2$3$4 $5$6$7$8$9
Adams Anselphotographer1902022019840422
Asimov Isaacauthor1920010219920406
Janney Allisonactress19591119
La Rue Lashactor1917061519960521
Sagan Carlastronomer/writer1934110919961220
Sharif Omarbridge expert/actor19320410

If you look carefully, you will see a leading blank on the first name. There is a way to get rid of it, but it’s not easy. Here is an awk command that prints the data a little more nicely:

awk -F'[;,-]' '{print $2, $1, "(" $3 ") was born in", $4 "."}' people.txt

Producing this output:

 Ansel Adams (photographer) was born in 1902.
 Isaac Asimov (author) was born in 1920.
 Allison Janney (actress) was born in 1927.
 Lash La Rue (actor) was born in 1917.
 Carl Sagan (astronomer/writer) was born in 1934.
 Omar Sharif (bridge expert/actor) was born in 1932.

printf

If you want complete control over your output, you need to master the printf function. The first argument to printf is a format string that gives text and “placeholder” information. Think of it like the “Mad Libs” game. A format string for the preceding output would be:

a string a string (a string) was born in an integer.

The format string is followed by the items that fill in the blanks. In awk, the printf for the preceding output would look like this:

awk -F'[;,-]' '{printf "%s %s (%s) was born in %d.\n", $2, $1, $3, $4}' people.txt

The placeholders always begin with a %. The letter following the placeholder tells what kind of data you are filling in. The most popular data types (called conversion characters in the book) are %s for a string, %d for an integer (decimal number), and %f for a floating point number.

When using printf, remember that a newline is not automatically added after your data is printed; if you want a newline, you must put in a \n.

You can precede the conversion character with a length that tells the minimum number of characters that the data should occupy. Values are always right-justified unless you specify otherwise. Try these awk commands to see what they do.

awk -F'[;,-]' '{printf "|%15s|%6d|\n", $1, $4}' people.txt
# - to left justify
awk -F'[;,-]' '{printf "|%-15s|%-6d|\n", $1, $4}' people.txt

# 0 for leading zeros if right justified
awk -F'[;,-]' '{printf "|%-15s|%06d|\n", $1, $4}' people.txt

# floating point (divide birth year by two to see effect)
awk -F'[;,-]' '{printf "|%-15s|%f|\n", $1, $4/2}' people.txt

# floating point (specify number of decimal places)
awk -F'[;,-]' '{printf "|%-15s|%.2f|\n", $1, $4/2}' people.txt

Note: A format like %7.2f does not mean seven digits to the left of the decimal and two to the right. It means that the field must be at least seven characters wide, with two to the right of the decimal point (and therefore four to the left, since the decimal point takes up one character).

awk statements in a file

Of course, if you have lots of awk commands to execute, you are better off putting them into a file. As you will learn, awk is a complete programming language, so you can create a file like this named people.awk

#
#	The BEGIN block is processed before the
#	first line of the file is read.
#
BEGIN { FS="[;,-]" }

$2 = substr($2, 2)  # remove leading blank on first name
printf "%s %s (%s) ", $2, $1, $3 	# notice no newline here
if (NF > 6) # if this line has more than 6 fields
{
  age = $7 - $4
  printf "died in %d at the age of %d.\n", $7, age
}
else
{
  printf "was born in %d and is still alive.\n", $4
}

and run this command to see it in action.

awk -f people.awk people.txt