This section discusses an advanced
feature of gawk
. If you are a novice awk
user,
you might want to skip it on the first reading.
Normally, when using FS
, gawk
defines the fields as the parts
of the record that occur in between each field separator. In other words,
FS
defines what a field is not, instead of what a field
is. However, there are times when we really want to define the fields
by what they are, and not by what they are not.
The FPAT
variable offers a solution for cases like this. The value
of FPAT
should be a string that provides a regular expression. This
regular expression describes the contents of each field.
We can explore the strengths, and some limitations, of
FPAT
using the case of comma-separated values (CSV)
data. This case is somewhat obsolete as gawk
now has
built-in CSV parsing
(see Working With Comma Separated Value Files).
Nonetheless, it remains useful as an example of what FPAT
-based field
parsing can do. It is also useful for versions of
gawk
prior to version 5.3.
Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If commas only separated the data, there wouldn’t be an issue with using ‘FS = ","’ to split the data into fields. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.27 So, we might have data like this:
Robbins,Arnold,,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
In the case of CSV data as presented here, each field is either “anything that
is not a comma,” or “a double quote, anything that is not a double quote, and
a closing double quote.” We also need to bear in mind that some fields may be
empty. If written as a regular expression constant (see Regular Expressions),
we would have /([^,]*)|("[^"]+")/
.
Writing this as a string requires us to escape the double quotes, leading to:
FPAT = "([^,]*)|(\"[^\"]+\")"
Putting this to use, here is a simple program to parse the data:
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")" }
{ print "NF =", NF for (i = 1; i <= NF; i++) { printf("$%d = <%s>\n", i, $i) } }
When run, we get the following:
$ gawk -f simple-csv.awk addresses.csv -| NF = 8 -| $1 = <Robbins> -| $2 = <Arnold> -| $3 = <> -| $4 = <"1234 A Pretty Street, NE"> -| $5 = <MyTown> -| $6 = <MyState> -| $7 = <12345-6789> -| $8 = <USA>
Note the empty data field in the value of $3
and the embedded comma in
the value of $4
, in which the data remains wrapped in its enclosing
double quotes.
The use of enclosing double quotes has a consequence for fields that contain double quotes as part of the data itself. For those fields, a double quote appearing inside a field must be escaped by preceding it with another double quote, as shown in the third and fourth lines of the following example:
1,2,3 p,"q,r",s p,"q""r",s p,"q,""r",s p,"",s p,,s
Manuel Collado suggests that the simplest FPAT
expression that
recognizes this kind of CSV field is /([^,]*)|("([^"]|"")+")/
.
The following program uses this improved FPAT
expression to split the
example CSV fields above, extracts the underlying data from any double-quoted
fields, and finally prints the data as tab-separated values.
The latter is accomplished by setting OFS
to a TAB character.
BEGIN { FPAT = "([^,]*)|(\"([^\"]|\"\")+\")" OFS = "\t" # Print tab-separated values }
{ for (i = 1; i <= NF; i++) { # Extract data from double-quoted fields if (substr($i, 1, 1) == "\"") { gsub(/^"|"$/, "", $i) # Remove enclosing quotes gsub(/""/, "\"", $i) # Convert "" to " } } $1 = $1 # force rebuild of the record print }
When run, it produces:
$ gawk -f quoted-csv.awk sample.csv -| 1 2 3 -| p q,r s -| p q"r s -| p q,"r s -| p s -| p s
Some programs export CSV data that contain embedded newlines between
the double quotes, and here we run into a limitation of FPAT
: it
provides no way to deal with this. Hence, using FPAT
to do your own CSV
parsing is an elegant approach for the majority of cases, but not all.
For a more general solution to working with CSV data, see
Working With Comma Separated Value Files. If your version of gawk
is prior to version 5.3, we recommend that you use Manuel Collado’s
CSVMODE
library for
gawk
.
As with FS
, the IGNORECASE
variable (see Built-in Variables That Control awk
)
affects field splitting with FPAT
.
Assigning a value to FPAT
overrides field splitting
with FS
and with FIELDWIDTHS
.
Finally, the patsplit()
function makes the same functionality
available for splitting regular strings (see String-Manipulation Functions).
The CSV format lacked a formal standard definition for many years. RFC 4180 standardizes the most common practices.