From sparkyfs!ames!sun-barr!cs.utexas.edu!tut.cis.ohio-state.edu!gem.mps.ohio-state.edu!brutus.cs.uiuc.edu!psuvax1!wuarchive!udel!burdvax!lang Thu Nov 16 16:25:12 PST 1989 It's time once again to post to this group a document that I have which explains some important things about (vanilla) AWK that are not elsewhere documented.... **************************************************************** \" to print this document, do ditroff -ms -Pip2 awk.supp .RP .TL .B A Supplemental Document For AWK .sp .R - or - .sp .I Things Al, Pete, And Brian Didn't Mention Much .R .AU John W. Pierce .AI Department of Chemistry University of California, San Diego La Jolla, California 92093 jwp%chem@sdcsvax.ucsd.edu .AB As .B awk and its documentation are distributed with .I 4.2 BSD UNIX* .R there are a number of bugs, undocumented features, and features that are touched on so briefly in the documentation that the casual user may not realize their full significance. While this document applies primarily to the \fI4.2 BSD\fR version of \fIUNIX\fR, it is known that the \fI4.3 BSD\fR version does not have all of the bugs fixed, and that it does not have updated documentation. The situation with respect to the versions of \fBawk\fR distributed with other versions \fIUNIX\fR and similar systems is unknown to the author. .FS *UNIX is a trademark of AT&T .FE .AE .LP In this document references to "the user manual" mean .I Awk - A Pattern Scanning and Processing Language (Second Edition) .R by Aho, Kernighan, and Weinberger. References to "awk(1)" mean the entry for .B awk in the .I UNIX Programmer's Manual, 4th Berkeley Distribution. .R References to "the documentation" mean both of those. .LP In most examples, the outermost set of braces ('{ }') have been ommitted. They would, of course, be necessary in real scripts. .NH Known Bugs .LP There are three main bugs known to me. They involve: .IP Assignment to input fields. .IP Piping output to a program from within an \fBawk\fR script. .IP Using '*' in \fIprintf\fR field width and precision specifications does not work, nor do '\\f' and '\\b' print formfeed and backspace respectively. .NH 2 Assignment to Input Fields .LP [This problem is partially fixed in \fI4.3BSD\fR; see the last paragraph of this section regarding the unfixed portion.] .LP The user manual states that input fields may be objects of assignment statements. Given the input line .DS field_one field_two field_three .DE the script .DS $2 = "new_field_2" print $0 .DE should print .DS field_one new_field_2 field_three .DE .LP This does not work; it will print .DS field_one field_two field_three .DE That is, the script will behave as if the assignment to $2 had not been made. However, explicitly referencing an "assigned to" field .I does recognize that the assignment has been made. If the script .DS $2 = "new_field_2" print $1, $2, $3 .DE is given the same input it will [properly] print .DS field_one new_field_2 field_three .DE Therefore, you can get around this bug with, e.g., .DS $2 = "new_field_2" output = $1 # Concatenate output fields for(i = 2; i <= NF; ++i) # into a single output line output = output OFS $i # with OFS between fields print output .DE .LP In \fI4.3BSD\fR, this bug has been fixed to the extent that the failing example above works correctly. However, a script like .DS $2 = "new_field_2" var = $0 print var .DE still gives incorrect output. This problem can be bypassed by using .DS \fIvar\fR = sprintf("%s", $0) .DE instead of "\fIvar\fR = $0"; \fIvar\fR will have the correct value. .NH 2 Piping Output to a Program .LP [This problem appears to have been fixed in \fI4.3BSD\fR, but that has not been exhaustively tested.] .LP The user manual states that .I print and .I printf statements may write to a program using, e.g., .DS print | "\fIcommand\fR" .DE This would pipe the output into \fIcommand\fR, and it does work. However, you should be aware that this causes .B awk to spawn a child process (\fIcommand\fR), and that it .I does not .R wait for the child to exit before it exits itself. In the case of a "slow" command like .B sort, .B awk may exit before .I command has finished. .LP This can cause problems in, for example, a shell script that depends on everything done by .B awk being finished before the next shell command is executed. Consider the shell script .DS awk -f awk_script input_file mv sorted_output somewhere_else .DE and the .B awk script .DS print output_line | "sort -o sorted_output" .DE If .I input_file is large .B awk will exit long before .B sort is finished. That means that the .B mv command will be executed before .B sort is finished, and the result is unlikely to be what you wanted. Other than fixing the source, there is no way to avoid this problem except to handle such pipes outside of the awk script, e.g. .DS awk -f awk_file input_file | sort -o sorted_output mv sorted_output somewhere_else .DE which is not wholly satisfactory. .LP See .I Sketchily Documented Features .R below for other considerations in redirecting output from within an .B awk script. .NH 2 Printf and '*', '\\f', and '\\b' .LP The document says that the \fIprintf\fR function provided is identical to the \fIprintf\fR provided by the \fIC\fR language \fBstdio\fR package. This is incorrect: '*' cannot be used to specify a field width or precision, and '\\f' and '\\b' cannot be used to print formfeeds and backspaces. .LP The command .DS printf("%*.s", len, string) .DE will cause a core dump. Given \fBawk\fR's age, it is likely that its \fIprintf\fR was written well before the use of '*' for specifying field width and precision appeared in the \fBstdio\fR library's \fIprintf\fR. Another possibility is that it wasn't implemented because it isn't really needed to achieve the same effect. .LP To accomplish this effect, you can utilize the fact that \fBawk\fR concatenates variables before it does any other processing on them. For example, assume a script has two variables \fIwid\fR and \fIprec\fR which control the width and precision used for printing another variable \fIval\fI: .DS [code to set "wid", "prec", and "val"] printf("%" wid "." prec "d\en", val) .DE If, for example, \fIwid\fR is 8 and \fIprec\fR is 3, then /fBawk\fR will concatenate everything to the left of the comma in the \fIprintf\fR statement, and the statement will really be .DS printf(%8.3d\en, val) .DE These could, of course, been assigned to some variable \fIfmt\fR before being used: .DS fmt = "%" wid "." prec "d" printf(fmt "\en", val) .DE Note, however, that the newline ("\en") in the second form \fIcannot\fR be included in the assignment to \fIfmt\fR. .LP To allow use of '\\f' and '\\b', \fBawk\fR's \fIlex\fR script must be changed. This is trivial to do (it is done at the point where '\\n' and '\\t' are processed), but requires having source code. [I have fixed this and have not seen any unwanted effects.] # .bp .NH Undocumented Features .LP There are several undocumented features: .IP Variable values may be established on the command line. .IP A .B getline function exists that reads the next input line and starts processing it immediately. .IP Regular expressions accept octal representations of characters. .IP A .B -d flag argument produces debugging output if .B awk was compiled with "DEBUG" defined. .IP Scripts may be "compiled" and run later (providing the installer did what is necessary to make this work). .NH 2 Defining Variables On The Command Line .LP To pass variable values into a script at run time, you may use .IP .I variable=value .LP (as many as you like) between any "\fB-f \fIscriptname\fR" or .I program and the names of any files to be processed. For example, .DS awk -f awkscript today=\e"`date`\e" infile .DE would establish for .I awkscript a variable named .B today that had as its value the output of the .B date command. .LP There are a number of caveats: .IP Such assignments may appear only between .B -f .I awkscript (or \fIprogram\fR or [see below] \fB-R\fIawk.out\fR) and the name of any input file (or '-'). .IP Each .I variable=value combination must be a single argument (i.e. there must not be spaces around the '=' sign); .I value may be either a numeric value or a string. If it is a string, it must be enclosed in double quotes at the time \fBawk\fR reads the argument. That means that the double quotes enclosing \fIvalue\fR on the command line must be protected from the shell as in the example above or it will remove them. .IP .I Variable is not available for use within the script until after the first record has been read and parsed, but it is available as soon as that has occurred so that it may be used before any other processing begins. It does not exist at the time the .B BEGIN block is executed, and if there was no input it will not exist in the .B END block (if any). .NH 2 Getline Function .LP .B Getline immediately reads the next input line (which is parsed into \fI$1\fR, \fI$2\fR, etc) and starts processing it at the location of the call (as opposed to .B next which immediately reads the next input line but starts processing from the start of the script). .LP .B Getline facilitates performing some types of tasks such as processing files with multiline records and merging information from several files. To use the latter as an example, consider a case where two files, whose lines do not share a common format, must be processed together. Shell and \fBawk\fR scripts to do this might look something like .sp In the shell script .DS ( echo DATA1; cat datafile1; echo ENDdata1 \e echo DATA2; cat datafile2; echo ENDdata2 \e ) | \e awk -f awkscript - > awk_output_file .DE In the .B awk script .DS /^DATA1/ { # Next input line starts datafile1 while (getline && $1 !~ /^ENDdata1$/) { [processing for \fIdata1\fR lines] } } .sp 1 /^DATA2/ { # Next input line starts datafile2 while (getline && $1 !~ /^ENDdata2$/) { [processing for \fIdata2\fR lines] } } .DE There are, of course, other ways of accomplishing this particular task (primarily using \fBsed\fR to preprocess the information), but they are generally more difficult to write and more subject to logic errors. Many cases arising in practice are significantly more difficult, if not impossible, to handle without \fBgetline\fR. .NH 2 Regular Expressions .LP The sequence "\fI\eddd\fR" (where 'd' is a digit) may be used to include explicit octal values in regular expressions. This is often useful if "nonprinting" characters have been used as "markers" in a file. It has not been tested for ASCII values outside the range 01 through 0127. .NH 2 Debugging output .LP [This is unlikely to be of interest to the casual user.] .sp If \fBawk\fR was compiled with "DEBUG" defined, then giving it a .B -d flag argument will cause it to produce debugging output when it is run. This is sometimes useful in finding obscure problems in scripts, though it is primarily intended for tracking down problems with \fBawk\fR itself. .NH 2 Script "Compilation" .LP [It is likely that this does not work at most sites. If it does not, the following will probably not be of interest to the casual user.] .sp The command .DS awk -S -f script.awk .DE produces a file named .B awk.out. This is a core image of .B awk after parsing the file .I script.awk. The command .DS awk -Rawk.out datafile .DE causes .B awk.out to be applied to \fIdatafile\fR (or the standard input if no input file is given). This avoids having to reparse large scripts each time they are used. Unfortunately, the way this is implemented requires some special action on the part of the person installing \fBawk\fR. .LP As \fBawk\fR is delivered with \fI4.2 BSD\fR (and \fI4.3 BSD\fR), .I awk.out is created by the \fBawk -S ...\fR process by calling .B sbrk() with '0', writing out the returned value, then writing out the core image from location 0 to the returned address. The \fBawk -R...\fR process reads the first word of .I awk.out to get the length of the image, calls .B brk() with that length, and then reads the image into itself starting at location 0. For this to work, \fBawk\fR must have been loaded with its text segment writeable. Unfortunately, the \fIBSD\fR default for \fBld\fR is to load with the text read-only and shareable. Thus, the installer must remember to take special action (e.g. "cc -N ..." [equivalently "ld -N ..."] for \fI4BSD\fR) if these flags are to work. .LP [Personally, I don't think it is a very good idea to give \fBawk\fR the opportunity to write on its text segment; I changed it so that only the data segment is overwritten.] .LP Also, due to what appears to be a lapse in logic, the first non-flag argument following \fB-R\fIawk.out\fR is discarded. [Disliking that behavior, the I changed it so that the \fB-R\fR flag is treated like the \fB-f\fR flag: no flag arguments may follow it.] # .bp .NH Sketchily Documented Features .LP .NH 2 Exit .LP The user manual says that using the .B exit function causes the script to behave as if end-of-input has been reached. Not menitoned explicitly is the fact that this will cause the .B END block to be executed if it exists. Also, two things are ommitted: .IP \fBexit(\fIexpr\fB)\fR causes the script's exit status to be set to the value of \fIexpr\fR. .IP If .B exit is called within the .B END block, the script exits immediately. .NH 2 Mathematical Functions .LP The following builtin functions exist and are mentioned in .I awk(1) but not in the user manual. .IP \fBint(\fIx\fB)\fR 10 \fIx\fR trunctated to an integer. .IP \fBsqrt(\fIx\fB)\fR 10 the square root of \fIx\fR for \fIx\fR >= 0, otherwise zero. .IP \fBexp(\fIx\fB)\fR 10 \fBe\fR-to-the-\fIx\fR for -88 <= \fIx\fR <= 88, zero for \fIx\fR < -88, and dumps core for \fIx\fR > 88. .IP \fBlog(\fIx\fB)\fR 10 the natural log of \fIx\fR. .NH 2 OFMT Variable .LP The variable .B OFMT may be set to, e.g. "%.2f", and purely numerical output will be bound by that restriction in .B print statements. The default value is "%.6g". Again, this is mentioned in .I awk(1) but not in the user manual. .NH 2 Array Elements .LP The user manual states that "Array elements ... spring into existence by being mentioned." This is literally true; .I any reference to an array element causes it to exist. ("I was thought about, therefore I am.") Take, for example, .DS if(array[$1] == "blah") { [process blah lines] } .DE If there is not an existing element of .B array whose subscript is the same as the contents of the current line's first field, .I one is created .R and its value (null, of course) is then compared with "blah". This can be a bit disconcerting, particularly when later processing is using .DS for (i in \fBarray\fR) { [do something with result of processing "blah" lines] } .DE to walk the array and expects all the elements to be non-null. Succinct practical examples are difficult to construct, but when this happens in a 500 line script it can be difficult to determine what has gone wrong. .NH 2 FS and Input Fields .LP By default any number of spaces or tabs can separate fields (i.e. there are no null input fields) and trailing spaces and tabs are ignored. However, if .B FS is explicitly set to any character other than a space (e.g., a tab: \fBFS = "\et"\fR), then a field is defined by each such character and trailing field separator characters are not ignored. For example, if '>' represents a tab then .DS one>>three>>five> .DE defines six fields, with fields two, four, and six being empty. .LP If .B FS is explicitly set to a space (\fBFS\fR = "\ "), then the default behavior obtains (this may be a bug); that is, both spaces and tabs are taken as field separators, there can be no null input fields, and trailing spaces and tabs are ignored. .NH 2 RS and Input Records .LP If .B RS is explicitly set to the null string (\fBRS\fR = ""), then the input record separator becomes a blank line, and the newlines at the end of input lines is a field separator. This facilitates handling multiline records. .NH 2 "Fall Through" .LP This is mentioned in the user manual, but it is important enough that it is worth pointing out here, also. .LP In the script .DS /\fIpattern_1\fR/ { [do something] } .sp /\fIpattern_2\fR/ { [do something] } .DE all input lines will be compared with both .I pattern_1 and .I pattern_2 unless the .B next function is used before the closing '}' in the .I pattern_1 portion. .NH 2 Output Redirection .LP Once a file (or pipe) is opened by .B awk it is not closed until .B awk exits. This can occassionally cause problems. For example, it means that a script that sorts its input lines into output files named by the contents of their first fields (similar to an example in the user manual) .DS { print $0 > $1 } .DE is going to fail if the number of different first fields exceeds about 10. This problem .I cannot be avoided by using something like .DS { command = "cat >> " $1 print $0 | command } .DE as the value of the variable .B command is different for each different value of .I $1 and is therefore treated as a different output "file". .LP [I have not been able to create a truly satisfactory fix for this that doesn't involve having \fBawk\fR treat output redirection to pipes differently from output to files; I would greatly appreciate hearing of one.] .NH 2 Field and Variable Types, Values, and Comparisons .LP The following is a synopsis of notes included with \fBawk\fR's source code. .NH 3 Types .LP Variables and fields can be strings or numbers or both. .NH 4 Variable Types .LP When a variable is set by the assignment .DS \fIvar\fR = \fIexpr\fR .DE its type is set to the type of .I expr (this includes +=, ++, etc). An arithmetic expression is of type .I number, a concatenation is of type .I string, etc. If the assignment is a simple copy, e.g. .DS \fIvar1\fR = \fIvar2\fR .DE then the type of .I var1 becomes that of .I var2. .LP Type is determined by context; rarely, but always very inconveniently, this context-determined type is incorrect. As mentioned in .I awk(1) the type of an expression can be coerced to that desired. E.g. .DS { \fIexpr1\fR + 0 .sp 1 \fIexpr2\fR "" # Concatenate with a null string } .DE coerces .I expr1 to numeric type and .I expr2 to string type. .NH 4 Field Types .LP As with variables, the type of a field is determined by context when possible, e.g. .RS .IP $1++ 8 clearly implies that \fI$1\fR is to be numeric, and .IP $1\ =\ $1\ ","\ $2 16 implies that $1 and $2 are both to be strings. .RE .LP Coercion is done as needed. In contexts where types cannot be reliably determined, e.g., .DS if($1 == $2) ... .DE the type of each field is determined on input by inspection. All fields are strings; in addition, each field that contains only a number is also considered numeric. Thus, the test .DS if($1 == $2) ... .DE will succeed on the inputs .DS 0 0.0 100 1e2 +100 100 1e-3 1e-3 .DE and fail on the inputs .DS (null) 0 (null) 0.0 2E-518 6E-427 .DE "only a number" in this case means matching the regular expression .DS ^[+-]?[0-9]*\e.?[0-9]+(e[+-]?[0-9]+)?$ .DE .NH 3 Values .LP Uninitialized variables have the numeric value 0 and the string value "". Therefore, if \fIx\fR is uninitialized, .DS if(x) ... if (x == "0") ... .DE are false, and .DS if(!x) ... if(x == 0) ... if(x == "") ... .DE are true. .LP Fields which are explicitly null have the string value "", and are not numeric. Non-existent fields (i.e., fields past \fBNF\fR) are also treated this way. .NH 3 Types of Comparisons .LP If both operands are numeric, the comparison is made numerically. Otherwise, operands are coerced to type string if necessary, and the comparison is made on strings. .NH 3 Array Elements .LP Array elements created by .B split are treated in the same way as fields. ---------------------------------------------------------------------------- Francois-Michel Lang Paoli Research Center, Unisys lang@prc.unisys.com (215) 648-7256 Dept of Comp & Info Science, U of PA lang@linc.cis.upenn.edu (215) 898-9511