[Next] [Previous] [Up] [Top] [Contents]

7.2 Text Processing Commands

7.2.3 awk, nawk, gawk

awk is a pattern scanning and processing language. Its name comes from the last initials of the three authors: Alfred. V. Aho, Brian. W. Kernighan, and Peter. J. Weinberger. nawk is new awk, a newer version of the program, and gawk is gnu awk, from the Free Software Foundation. Each version is a little different. Here we'll confine ourselves to simple examples which should be the same for all versions. On some OSs awk is really nawk.

awk searches its input for patterns and performs the specified operation on each line, or fields of the line, that contain those patterns. You can specify the pattern matching statements for awk either on the command line, or by putting them in a file and using the -f program_file option.

Syntax

awk program [file]

where program is composed of one or more:

pattern { action }

fields. Each input line is checked for a pattern match with the indicated action being taken on a match. This continues through the full sequence of patterns, then the next line of input is checked.

Input is divided into records and fields. The default record separator is <newline>, and the variable NR keeps the record count. The default field separator is whitespace, spaces and tabs, and the variable NF keeps the field count. Input field, FS, and record, RS, separators can be set at any time to match any single character. Output field, OFS, and record, ORS, separators can also be changed to any single character, as desired. $n, where n is an integer, is used to represent the nth field of the input record, while $0 represents the entire input record.

BEGIN and END are special patterns matching the beginning of input, before the first field is read, and the end of input, after the last field is read, respectively.

Printing is allowed through the print, and formatted print, printf, statements.

Patterns may be regular expressions, arithmetic relational expressions, string-valued expressions, and boolean combinations of any of these. For the latter the patterns can be combined with the boolean operators below, using parentheses to define the combination:

|| or

&& and

! not

Comma separated patterns define the range for which the pattern is applicable, e.g.:

/first/,/last/

selects all lines starting with the one containing first, and continuing inclusively, through the one containing last.

To select lines 15 through 20 use the pattern range:

NR == 15, NR == 20

Regular expressions must be enclosed with slashes (/) and meta-characters can be escaped with the backslash (\). Regular expressions can be grouped with the operators:

| or, to separate alternatives

+ one or more

? zero or one

A regular expression match can be either of:

~ contains the expression

!~ does not contain the expression

So the program:

$1 ~ /[Ff]rank/

is true if the first field, $1, contains "Frank" or "frank" anywhere within the field. To match a field identical to "Frank" or "frank" use:

$1 ~ /^[Ff]rank$/

Relational expressions are allowed using the relational operators:

< less than

<= less than or equal to

== equal to

>= greater than or equal to

!= not equal to

> greater than

Offhand you don't know if variables are strings or numbers. If neither operand is known to be numeric, than string comparisons are performed. Otherwise, a numeric comparison is done. In the absence of any information to the contrary, a string comparison is done, so that:

$1 > $2

will compare the string values. To ensure a numerical comparison do something similar to:

( $1 + 0 ) > $2

The mathematical functions: exp, log and sqrt are built-in.

Some other built-in functions include:

index(s,t) returns the position of string s where t first occurs, or 0 if it doesn't

length(s) returns the length of string s

substr(s,m,n) returns the n-character substring of s, beginning at position m

Arrays are declared automatically when they are used, e.g.:

arr[i] = $1

assigns the first field of the current input record to the ith element of the array.

Flow control statements using if-else, while, and for are allowed with C type syntax:

for (i=1; i <= NF; i++) {actions}

while (i<=NF) {actions}

if (i<NF) {actions}

Common Options

-f program_file read the commands from program_file

-Fc use character c as the field separator character

Examples

% cat filex | tr a-z A-Z | awk -F: '{printf ("7R %-6s %-9s %-24s \n",$1,$2,$3)}'>upload.file

cats filex, which is formatted as follows:

nfb791:99999999:smith

7ax791:999999999:jones

8ab792:99999999:chen

8aa791:999999999:mcnulty

changes all lower case characters to upper case with the tr utility, and formats the file into the following which is written into the file upload.file:

7R NFB791 99999999 SMITH

7R 7AX791 999999999 JONES

7R 8AB792 99999999 CHEN

7R 8AA791 999999999 MCNULTY


Introduction to Unix - 14 AUG 1996
[Next] [Previous] [Up] [Top] [Contents]