awk
awk lets you manipulate files that consist of
fields separated by delimiters. One example of such a
file is this file, people.txt,
which contains information from
the Dead People Server:
Adams, Ansel;photographer;1902-02-20;1984-04-22 Asimov, Isaac;author;1920-01-02;1992-04-06 Falk, Peter;actor;1927-09-16 La Rue, Lash;actor;1917-06-15;1996-05-21 Sagan, Carl;astronomer/writer;1934-11-09;1996-12-20 Sharif, Omar;actor;1932-04-10
In this case, the fields are separated by semicolons, and the fields represent, from left to right, the person's name, occupation, date of birth, and date of death (if dead).
awk Processing Cycle
The most important thing to know about awk is the steps it takes
when processing a file.
$0.awk command can be invoked
with an option that tells what the delimiter is. If you don’t
give a delimiter, then fields are delimited by whitespace.
The first field is placed in variable $1, the
second in $2, and so forth
{ and
}
If we run awk on the people.txt
with just whitespace as delimiters, we will end up with
the following split of fields:
| $1 | $2 |
|---|---|
| Adams, | Ansel;photographer;1902-02-20;1984-04-22 |
| Asimov, | Isaac;author;1920-01-02;1992-04-06 |
| Falk, | Peter;actor;1927-09-16 |
| La | Rue, |
| Sagan, | Carl;astronomer/writer;1934-11-09;1996-12-20 |
| Sharif, | Omar;bridge |
Notice that the blank in Lash La Rue’s name
and the blank in bridge expert started a new field!
Clearly, that's not what we want; we would like to have the fields separated
by semicolons, so if we run awk, specifying the input field
separator delimiter:
awk -F';' '{some command}' people.txt
We get this split of fields:
| $1 | $2 | $3 | $4 |
|---|---|---|---|
| Adams, Ansel | photographer | 1902-02-20 | 1984-04-22 |
| Asimov, Isaac | author | 1920-01-02 | 1992-04-06 |
| Falk, Peter | actor | 1927-09-16 | |
| La Rue, Lash | actor | 1917-06-15 | 1996-05-21 |
| Sagan, Carl | astronomer/writer | 1934-11-09 | 1996-12-20 |
| Sharif, Omar | bridge expert/actor | 1932-04-10 |
Let’s say I wanted a list of all the people and their date of birth.
I would write an awk command like this:
awk -F';' '{print $1, "was born", $3 "."}' people.txt
producing the following output. Notice that putting commas between the
arguments to print produces a blank between the output fields. Not putting
a comma between print arguments
(the $3 and ".")
makes them print out right next to each other
Adams, Ansel was born 1902-02-20. Asimov, Isaac was born 1920-01-02. Falk, Peter was born 1927-09-16. La Rue, Lash was born 1917-06-15. Sagan, Carl was born 1934-11-09. Sharif, Omar was born 1932-04-10.
If I wanted the name in the proper order and just the year of birth,
I would split the fields on semicolon (to get the main entries),
comma (to get the first and last names separated), and dash (to get the
dates separated). You specify all the delimiters in square brackets,
like a grep character class. If dash is one of your delimiters,
you must place it either first or last in the square brackets.
awk -F'[;,-]' '{some command}' people.txt
Gives me this split of fields:
| $1 | $2 | $3 | $4 | $5 | $6 | $7 | $8 | $9 |
|---|---|---|---|---|---|---|---|---|
| Adams | Ansel | photographer | 1902 | 02 | 20 | 1984 | 04 | 22 |
| Asimov | Isaac | author | 1920 | 01 | 02 | 1992 | 04 | 06 |
| Falk | Peter | actor | 1927 | 09 | 16 | |||
| La Rue | Lash | actor | 1917 | 06 | 15 | 1996 | 05 | 21 |
| Sagan | Carl | astronomer/writer | 1934 | 11 | 09 | 1996 | 12 | 20 |
| Sharif | Omar | bridge expert/actor | 1932 | 04 | 10 |
If you look carefully, you will see a leading blank on the first name. There is
a way to get rid of it, but it’s not easy. Here is an awk
command that prints the data a little more nicely:
awk -F'[;,-]' '{print $2, $1, "(" $3 ") was born in", $4 "."}' people.txt
Producing this output:
Ansel Adams (photographer) was born in 1902. Isaac Asimov (author) was born in 1920. Peter Falk (actor) was born in 1927. Lash La Rue (actor) was born in 1917. Carl Sagan (astronomer/writer) was born in 1934. Omar Sharif (bridge expert/actor) was born in 1932.
printf
If you want complete control over your output, you need to master the
printf function. The first argument to
printf is a format string that gives text and
“placeholder” information. Think of it like the “Mad Libs”
game. A format string for the preceding output would be:
a string a string (a string) was born in an integer.
The format string is followed by the items that fill in the blanks. In
awk, the printf for the preceding output would
look like this:
awk -F'[;,-]' '{printf "%s %s (%s) was born in %d.\n", $2, $1, $3, $4}' people.txt
The placeholders always begin with a %. The letter following the
placeholder tells what kind of data you are filling in. The most popular
data types (called conversion characters in the book) are
%s for a string, %d for an integer (decimal number),
and %f for a floating point number.
When using printf, remember that a newline is not automatically
added after your data is printed; if you want a newline, you must put in a
\n.
You can precede the conversion character with a length that tells
the minimum number of characters that the data should occupy. Values are
always right-justified unless you specify otherwise. Try these
awk commands to see what they do.
awk -F'[;,-]' '{printf "|%15s|%6d|\n", $1, $4}' people.txt
# - to left justify
awk -F'[;,-]' '{printf "|%-15s|%-6d|\n", $1, $4}' people.txt
# 0 for leading zeros if right justified
awk -F'[;,-]' '{printf "|%-15s|%06d|\n", $1, $4}' people.txt
# floating point (divide birth year by two to see effect)
awk -F'[;,-]' '{printf "|%-15s|%f|\n", $1, $4/2}' people.txt
# floating point (specify number of decimal places)
awk -F'[;,-]' '{printf "|%-15s|%.2f|\n", $1, $4/2}' people.txt
Note: A format like
%7.2f does not mean seven digits to the left of the
decimal and two to the right. It means that the field must be at least
seven characters wide, with two to the right of the decimal point (and therefore
four to the left, since the decimal point takes up one character).
awk statements in a file
Of course, if you have lots of awk commands to execute,
you are better off putting them into a file. As you will learn,
awk is a complete programming language, so you can create
a file like this named people.awk
#
# The BEGIN block is processed before the
# first line of the file is read.
#
BEGIN { FS="[;,-]" }
{
$2 = substr($2, 2) # string positions begin at 1, not zero
printf "%s %s (%s) ", $2, $1, $3 # notice no newline here
if (NF > 6) # if this line has more than 6 fields
{
age = $7 - $4
printf "died in %d at the age of %d.\n", $7, age
}
else
{
printf "was born in %d and is still alive.\n", $4
}
}
and run this command to see it in action.
awk -f people.awk people.txt