UNIX tip of the day: two file processing with AWK

Posted in wordpress

= 01202018 robfelty

I recently came across some AWK code from a work colleague that I did not understand at all
awk -F'\t' -v OFS='\t' 'FNR==NR{a[$1]=$1;next};$1 in a{print $1,$2,$3}' file1 file2
I usually like to understand code instead of blindly copying and pasting, so I did a little research into what this was doing. Searching for “awk FNR NR” got me to this stackoverflow page: linux – What is “NR==FNR” in awk?

And that led me in turn to this excellent article about Idiomatic awk . I’ll summarize some of the points from there

NR = record number, starting with 1. By default the record separator (RS) is a newline, so this amounts to a line number. When processing 2 files, AWK first processes the first file one record at a time, and then the second file. The NR continues to increment for both files.

FNR = file record number. This counter starts back at 1 for each file.

To explain further, I am going to use 2 example files – foo.txt and bar.txt, with contents like so

$ tail foo.txt bar.txt == foo.txt == a b c == bar.txt == a 12 blah c 42 yada

One thing that I learned from the idiomatic AWK is that I have not been writing idiomatic AWK. I have been writing awk scripts like

awk '{ if ($1=="a") { print }}' foo.txt

This works, but is not idiomatic. AWK scripts follow the general pattern of CONDITION { ACTIONS }. Thus I don’t need an if statement to create a conditional. Using idiomatic AWK, this is simply:

awk '$1=="a" { print }' foo.txt

Now I can start to understand the first script better.

FNR==NR is a condition

a[$1]=$1;next is the action

Now what is that doing?

a is a variable – in this case an associative array (hash, dict, i.e. key-value store). AWK doesn’t require you to initialize variables at all. So this is building up an array, with they key and value both set to the value of the first column. Since we have the condition FNR==NR this block will only execute when reading the first file. Finally, we have the next, which says to skip the rest of the condition-action pairs.

Now on to this part:
$1 in a{print $1,$2,$3}
This is another condition-action pair. Note that because of the first condition-action of FNR==NR {next}, this second condition-action pair will only be applied when processing the second file. $1 in a is the condition. It is saying “if the first column of this record in file2 matches a key in the array a, then print the first, second, and third column of file2

Okay – only a few more things left to explain – let’s look at the options given to awk

-F ‘\t’ this defines the input field separator to be a tab. (default is space)

-v OFS=’\t’ The -v option lets you set a variable in the script. In this case, we are setting the special variable OFS (output field separator) to tab (default is space)

Note that instead of using -F and -v, you could also specify these values in a BEGIN block, which is executed before any data is processed, like so
$ awk 'BEGIN {IFS="\t"; OFS="\t"} FNR==NR {a[$1]=$1;next}; $1 in a { print $1,$2,$3}' foo.txt bar.txt a 12 blah c 42 yada‍‍‍‍
But that requires a little more typing

One other thing to keep in mind with AWK – whitespace usually doesn’t count much for anything. When I first saw

FNR==NR{a[$1]=$1;next} I thought that the curly brace after NR was specifying an array index, like you might do in Perl. Nope – that curly brace is just specifying the action block, and you don’t need any whitespace between the condition and the action

Let’s go even one step further. What if we have more than 2 files? Let’s say we have 3 salespeople x, y, and z selling products a, b, and c. Each salesperson gives you a report of their sales. We want to find the total sales for each product. Here are their files:
$ tail [x-z].txt == x.txt == a 10 c 22 == y.txt == b 12 c 42 == z.txt == a 16 b 32

In SQL, we would use a GROUP BY with a SUM. In AWK, we will again just use an associative array, and add the values to it.
$ awk -F '\t' -v OFS='\t' '{a[$1] += $2} END { for (k in a) { print k,a[k] } }' [x-z].txt a 26 b 44 c 64 ‍‍‍‍

We use a similar approach here, except that we don’t need to do any special processing for the first versus subsequent files. We simply build up the array a, and after we have processed all the lines, we print out the totals by looping over a in an END block. Note that in this case, it doesn’t matter if the data is in separate files or in one file. We could have also just concatenated all the data together first, and then piped to AWK like so:
$ cat [x-z].txt | awk -F '\t' -v OFS='\t' '{a[$1] += $2} END { for (k in a) { print k,a[k] } }' a 26 b 44 c 64