|STAT: Data Manipulation and Analysis Programs for MSDOS and UNIX A Tutorial Overview of Release 5.3 Gary Perlman Copyright 1986, 1987 Gary Perlman Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage, the above copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of Gary Perlman. To copy otherwise, or to republish, requires a fee and/or specific permission. *1. Introduction |STAT is a collection of over 25 cooperating data manipulation and analysis programs running on the MSDOS and UNIX operating systems. In this paper, I will give examples of the main features of |STAT programs and how they work together. |STAT is an interesting package for several reasons. Portability |STAT programs run on all MSDOS and UNIX systems. Familiarity |STAT programs integrate into existing familiar environments and do not require users to learn many new systems. |STAT input files are plain text files and can be edited with almost any text editor or word processing software, so no new input editor need be learned. |STAT programs are run from the UNIX command shells and MSDOS COMMAND.COM, so no new command system need be learned. |STAT does not introduce its own programming language, but depends on the existing capabilities of MSDOS BATCH files and UNIX shell scripts. Finally, |STAT does not need to recreate existing functionality because |STAT programs can be used with standard utilities. Usability The programs are easy to use because |STAT uses simple, consistent input/output, user interface, and other conventions throughout the package, more so than most UNIX or MSDOS utilities. Most users can run simple analyses within a few minutes of first seeing the documentation. Versatility The package makes heavy use of the UNIX design philosophy of building simple tools that can be combined to do complex tasks. Most |STAT programs are tools that can be used with other MSDOS or UNIX utilities on tasks not related to data analysis. Applicability |STAT provides the most commonly used analyses, especially those taught in undergraduate statistics, although it is not a comprehensive statistical package. Field-Tested Versions of |STAT has been used at hundreds of educational, research, industrial, and government sites by thousands of users since 1980. Affordability The package is distributed under a liberal copyright and for a low one- time distribution cost. Unlimited copies of the programs and documentation can be made and distributed, provided such copies are not made for gain. This policy makes the package ideal for teaching, because free copies can be made for students who can take |STAT with them at the end of a course. *2. Overview of |STAT *2.1. Introductory Examples In these examples, I hope to show how |STAT programs work together. I will go through increasingly complicated examples, explaining new features as I use them. I will show how the examples would look on MSDOS, rather than UNIX, but the form of commands is similar for both systems. By the end of these examples, you should understand how to construct complicated commands by combining data manipulation and analysis tools. In all the examples, I will use upper-case letter names to refer to program and file names in the text. In commands set apart from the text, I will use lower-case names. The MSDOS command interpreter is insensitive to case differences in program and file names, so in practice, lower case would be typed. To do data analysis, we need data. Rather than collect data, or make up a realistic example, I will use a series of numbers between 1 and 100, a data set that is easy to understand. There is a |STAT utility to generate a series of numbers, called SERIES. The following command prints the series of 1 to 100 (1, 2, 3, ..., 100), with each number on a new line. series 1 100 This series prints out on the screen, the default or ``standard output'' for all |STAT programs. The series could be saved in a file by using ``output redirection.'' The following command generates a series of numbers from 1 to 100 and saves the output in a file called NUMBERS.DAT. The notation ``> filename'' tells the command interpreter to place the output of a command in the named file, rather than on the screen. series 1 100 > numbers.dat We can get summary statistics on these numbers with the STATS command. By default, |STAT programs read their input from the keyboard, but the default or ``standard input'' can be redirected from a file. The following command says for STATS to print the mean and standard deviation (sd) of the data in the file NUMBERS.DAT. The notation ``< filename'' causes a program to read from the named file, rather than from the keyboard. stats mean sd < numbers.dat STATS reads numbers in ``free-format.'' That is, it looks for numbers separated by any white space (i.e., blank spaces, tabs, new lines). The output from the STATS program is one line with the mean and standard deviation of the numbers in NUMBERS.DAT. 50.5 29.0115 Many |STAT programs can read data like that in NUMBERS.DAT. DESC is a descriptive statistics program that prints many summary statistics about its free-format input. The following command says that DESC will read from the file NUMBERS.DAT, and put its output in the file RESULTS. desc < numbers.dat > results These results could be printed on paper with the standard PRINT command. print results A similar result could be obtained by sending the output of DESC directly to the printer device, PRN. desc < numbers.dat > prn Now suppose you want to analyze transformed data, rather than the raw data. Most |STAT programs do not have built-in transformation facilities. The most important |STAT program that specializes in transformations is DM, the data manipulator. Suppose we want to summarize the natural logarithms of the numbers in NUMBERS.DAT. The following command tells DM to print the logarithm of the first numeric column (x1) of its input from NUMBERS.DAT. To DM, and other |STAT programs, a column is any text on a line, separated from other text by white space. dm log(x1) < numbers.dat This transformed data could then be analyzed with an analysis program, such as STATS or DESC. To transform a data set, never output to the input file like this: dm log(x1) < numbers.dat > numbers.dat because the input file would get erased before it could be read. The only way to do the job is with a temporary file. MSDOS (version 2.0 or later) and UNIX both give users the ability to take the output of one program and make it the input to another. This is called pipelining, because it is as though there is a pipe conveying the information (data) from one program directly to another. We could have computed the mean and standard deviation of the numbers from 1 to 100 with the following pipeline. series 1 100 | stats mean sd The pipe symbol, the vertical bar, |, says to take the output of the preceding program (SERIES), and make it the input to the following program (STATS). The command to transform with DM and analyze the series with DESC would look like this: dm log(x1) < numbers.dat | desc The heavy use of the pipe symbol is why the package is named |STAT. Piping is about the same as saving the transformed numbers in a temporary file, making the file the input to DESC, and then removing the temporary file. This longer but equivalent form is shown below. dm log(x1) < numbers.dat > tmp desc < tmp del tmp Pipelines allow users to avoid creating and removing temporary files, so creating the file NUMBERS.DAT could have been avoided. The analysis of the logarithms of the first 100 natural numbers could be done with the following pipeline. series 1 100 | dm log(x1) | desc This is a long command line for most MSDOS users, and with |STAT commands it is useful to be able to edit (insert and delete characters) within the current and previous command lines. There are several public domain command line editors available for MSDOS, and for UNIX, KSH (Korn, 1983) can be purchased from AT&T through the UNIX toolchest (Brooks, 1985). The next few examples will show how to control program outputs with options. When no options are supplied, |STAT programs assume the ones most often requested. Most programs have options to control what programs produce, and all programs with options have built-in option summaries. |STAT program options follow the UNIX conventions of being single letters preceded by a minus sign. The DESC descriptive statistics program can print summary statistics, frequency tables, and histograms. To get a histogram of the data in the file NUMBERS.DAT, you would run DESC with the -h (histogram) option. desc -h < numbers.dat If you want both summary statistics and a histogram, you would use the -s and -h options. This could be typed as: desc -s -h < numbers.dat or as: desc -sh < numbers.dat because logical (on/off) options can be ``bundled.'' Besides logical options, there are options for which values must be supplied, such as the width of a plot. *2.2. Package Conventions |STAT programs are designed for human efficiency first, and then program efficiency. A program's performance, especially on a single-user system, should not be measured without considering how long it takes a person to get it running correctly. An early reason for writing |STAT programs was that with many packages it took too long to write the request for the desired analyses, often with false attempts. The main design feature of |STAT programs is that each program is a specialized tool. These tools are designed to cooperate, so they can be combined to do many tasks. Functionality found in many data analysis programs, such as data transformations, are in separate specialist tools. Some overlap is necessary: To insure that analysis program inputs are reasonable, data type and range checking must be built into all analysis programs. One result of the tool design philosophy for |STAT is that analyses take the form shown below. EXTRACT DATA -> TRANSFORM -> ANALYSIS -> RESULT FORMAT An analysis begins with raw data, or data generated by a program like SERIES, which generates a series of numbers. A subset of the data is extracted (copied, really), transformed, or formatted, to be ready for input to an analysis program, which produces results. |STAT program user interfaces are not flashy, but they are effective. The main design principles are simplicity, consistency, robustness, and feedback. After the examples, I will try to make these design principles more concrete. |STAT programs make heavy use of the standard input and standard output, using the <, |, and > notation in the MSDOS and UNIX command interpreters. One benefit of this is uniformity; users do not need to learn many rules for reading and writing files. Another benefit is that |STAT programs can work together, and with standard MSDOS and UNIX utilities (any that read the standard input and write the standard output). As I have emphasized, |STAT programs are specialized tools, and I have tried to avoid duplicating the functions handled by existing tools. Sophisticated graphics or data entry are handled by graphics packages or generic text editing programs, respectively. |STAT programs do not assume any unusual data file format, but use human-readable files, so |STAT users do not have to learn how to use yet another text editor. |STAT programs can work with almost any program that reads plain text files. Perhaps the greatest benefit of the |STAT adoption of the UNIX tool development philosophy is that the tools can be combined in BATCH files on MSDOS, and shell scripts on UNIX. These give |STAT users a simple, universally available programming language, examples of which are given later. 3. Data Manipulation Programs |STAT data manipulation programs can work with standard MSDOS and UNIX tools. |STAT data files are human readable text, so users can use any text editor to enter or modify data. Also, all the standard file manipulation programs available on UNIX and MSDOS are compatible. In Table 1, we see that there are many existing file handling programs available. These work smoothly with |STAT programs, and are not duplicated in |STAT. Table 1: Standard MSDOS and UNIX Utilities ------------------------------------------ MSDOS UNIX Purpose cd cd, pwd change/print working directory comp diff compare and list file differences copy cp copy files del,erase rm remove/delete files dir ls list files in directory echo echo print text to standard output find grep search for pattern in files mkdir mkdir make a new directory more more paginate text on screen print print print files on line printer rename mv move/rename files rmdir rmdir remove directory sort sort sort lines in files type cat print files to standard output The |STAT data manipulation programs are summarized in Table 2. There are programs for data generation, numerical transformations, formatting, extraction of subsets of data, and validation. Many programs have several uses. Table 2: Data Manipulation Programs ----------------------------------- abut join data files beside each other colex column extraction/formatting dm conditional data extraction/transformation dsort multiple key data sorting filter linex line extraction maketrix create matrix format file from free-format input perm permute line order randomly, numerically, alphabetically probdist probability distribution functions ranksort convert data to ranks repeat repeat strings or lines in files reverse reverse lines, columns, or characters series generate an additive series of numbers transpose transpose matrix format input validata verify data file consistency *3.1. Data Manipulation Program Descriptions There are too many programs to describe in detail here, but I'll describe the important features of each. Full documentation is in the |STAT Handbook. ABUT takes several files and joins the corresponding lines in each. It builds inputs for analysis programs like ANOVA and REGRESS that read multi- column inputs. COLEX extracts ranges of white space separated columns. It also formats columns so that you can control the width of fields, left or right justification, and alignment of decimal points. DM is by far the most important data manipulation tool, and only a small part of its versatility will be shown in examples. DM takes a series of expressions involving input columns, and for each expression, DM prints an output column containing the values of the expression. DM has most arithmetic and transcendental functions, string operations, condition testing, and control of flow. For example, DM can print all the lines with some specified text in a specified column, or print the sum of all the numerical columns on each line. DSORT reorders lines in a data file according to the order of values in any columns. Ordering can be numerical or alphabetical, in increasing or decreasing sequence. LINEX extracts individual lines and ranges of lines by line numbers. MAKETRIX creates a matrix format file by reading white space separated words, and printing them so there there are an equal number of words per line. PERM prints a permutation (reordering) of its input lines. By default, PERM prints the lines in random order, but it can also print them in numerical or alphabetical order. PROBDIST deals with several probability distributions: uniform, binomial, normal, chi-square, t, and F. For any of these distributions, it can generate random numbers, compute the cumulative probability of obtaining a particular value, or compute the distribution value needed for a particular cumulative probability. RANKSORT converts the data in each of its input columns to ranks to allow the application of some non-parametric methods. REPEAT generates multiple copies of its input. REVERSE reverses the order of lines in its input, fields (columns) in its inputs lines, or characters within lines, in any combination. SERIES generates linear series between two values. The series can be ascending or descending. By default, there is a difference of one between adjacent series elements, but this can be changed to smaller or larger units. Non-linear series can be generated by transforming a linear series with DM. TRANSPOSE flips the rows and columns of its input: the matrix transpose operation. VALIDATA reports the data types of its input columns and where, if at all, there is a change in the number of columns-per-line. If a data analysis program reports an input format error, VALIDATA can help find the problem. *3.2. Data Manipulation Examples The following examples will show how |STAT programs can be combined to perform complicated manipulations. Suppose you want to generate a series of numbers in an inverse progression: 1/1 1/2 1/3 1/4 1/5 1/6 .... The SERIES program generates linear series, but the output from SERIES can be transformed by DM. The following command generates a series from 1 to 100 and computes the inverse of each number in the series. series 1 100 | dm 1/x1 Another series contains the squares of the first 100 integers: 1, 4, 9, ..., 10000. This can be generated with SERIES and a different DM transformation. The following transformation by DM computes and prints the squares of the values in its input column. series 1 100 | dm x1*x1 This series of squares could be transformed back to the original linear series with another (the square root) transformation by DM. series 1 100 | dm x1*x1 | dm sqrt(x1) Suppose you want to generate a 5 by 5 matrix of random numbers selected without replacement from 1 to 25. This is a reformatted random permutation of the numbers from 1 to 25. First, you would generate the numbers from 1 to 25 with SERIES. Then, you would permute the values with the PERM program. Finally, you would make a matrix format file with the MAKETRIX program. series 1 25 | perm | maketrix 5 To sample from the same values, but with replacement, the PROBDIST (probability distribution) routine is used. The following command generates 25 values uniformly distributed from 0.0 up to, but not including, 1.0. probdist rand uni 25 The range of these numbers can be transformed with DM by multiplying by 25 and adding 1 to make the range 1.0 up to 25.99..., and by truncating the values to their integer parts with the floor function. Then MAKETRIX reformats the values into a matrix. probdist rand uni 25 | dm floor(x1*25+1) | maketrix 5 To get summary statistics on the last column of a file is easy if the file has the same number of columns on each line. Suppose the file MATRIX.DAT has six columns. We could use COLEX to extract the sixth column. colex 6 < matrix.dat | stats But if a file does not have the same number of columns per line, or if the number of columns is unknown, then REVERSE can reverse the columns on every line so that COLEX can extract the first. In the following example, REVERSE is called with the -f option (reverse ``fields'' on a line) and the output is piped to COLEX. reverse -f < matrix.dat | colex 1 With no options, REVERSE reverses line order. For example, series 1 10 | reverse would print, 10, 9, ..., 1, with each number on a separate line. Although this is an intuitive example, it is not realistic, because SERIES can generate descending series. Calling SERIES with the operands 10 and 1 would have the same effect. DM has special variables containing information about each line. For example, N is the number of columns, and SUM is the sum of all the numbers. One way to compute the sum of the integers from 1 to 50 is to put them all on one line, and print the SUM variable. SERIES generates one number per line, but the TRANSPOSE program can reformat an N-line 1-column file to a 1-line N- column form, just what is needed by DM. series 1 50 | transpose | dm SUM A faster way would be to use the sum request for STATS: series 1 50 | stats sum The MSDOS SORT utility is good for sorting lines alphabetically, but not numerically. For example, here is the output from SORT, transposed so that it fits on one line, when SORT is given the integers from -5 to 15. series -5 15 | sort | transpose -1 -2 -3 -4 -5 0 1 10 11 12 13 14 15 2 3 4 5 6 7 8 9 Numerically, the first five numbers are descending, then there is 0 and 1, then 10 through 15, and then the rest of the single digits. As character strings, they are in the right order, but not as numbers. The DSORT data sorting program can sort numbers or strings, and it is smart enough to use the right sorting method for the data. series -5 15 | dsort | transpose -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 From these examples, you can see how the |STAT programs work together, and with MSDOS utilities. There are many ways of accomplishing the same task, and the choice of one over the other can be for the sake of efficiency or readability. The possibilities for command construction increase with practice and by reading the manual entries for individual programs and the |STAT Handbook (Perlman, 1987). *4. Data Analysis Programs |STAT analysis programs compute descriptive and inferential statistics and simple graphs. There are programs for analyzing single variables, paired variables, several numerical variables, and categorical variables. Table 3: Data Analysis Programs ------------------------------- anova multi-factor analysis of variance calc interactive algebraic modeling calculator contab contingency tables and chi-square desc descriptions, histograms, frequency tables dprime signal detection d' and beta calculations oneway one-way anova/t-test with error-bar plots pair paired data statistics, regression, scatterplots rankind rank order analysis for independent conditions rankrel rank order analysis for related conditions regress multiple linear regression and correlation stats simple summary statistics ts time series analysis and plots *4.1. Data Analysis Program Descriptions ANOVA performs a multifactor analysis of variance with one random factor, and up to nine other between groups (nested) or within groups (crossed) factors. Between groups factors can have unequal cell sizes, for which the weighted means solution is used. The method of analysis follows Keppel (1973). CALC puts the algebraic capabilities of DM into an interactive calculator. It allows the definition of named variables, and what-if style changes to these definitions. CONTAB is a multi-way cross-tabulation program with chi-square tests of association. The method of analysis follows Siegel (1956) and Bradley (1968). DESC prints descriptive statistics, t-test, frequency tables, and histograms. DPRIME computes signal detection theory values for discrimination (d') and bias (beta). The method of analysis follows Coombs, Dawes & Tversky (1970). ONEWAY performs a one-way analysis of variance for which groups can have unequal cell sizes. Optionally, group error-bar plots are printed. PAIR analyzes and plots paired data. Summary statistics and t-tests are printed for each variable and their difference. A simple linear regression prints the regression equation, correlation coefficient, and significance test for the correlation/regression. RANKIND analyses data measured on an ordinal scale from independent groups. order statistics are reported with tests of location such as the median, Mann-Whitney, and Kruskal-Wallis tests, as described in Siegel (1956). RANKREL analyses data measured on an ordinal scale from related samples. Order statistics are reported with tests of location such as the sign test, Wilcoxon signed-ranks test, and Friedman anova of ranks, as described in Siegel (1956). REGRESS performs multiple linear regression. An optional analysis determines if a predictor significantly improves the multiple R after other predictors have been included. The method of analysis follows Kerlinger & Pedhazur (1973). STATS prints selected summary statistics with little if any annotation. TS does a simple time series analysis: autocorrelations, series rescaling, and plots. *4.2. Data Analysis Examples To understand all the examples, some knowledge of statistics is needed, but to get a feel for how the programs are used, a little experience with MSDOS or UNIX will be enough. 4.2.1. DESC: Describe a Data Set DESC describes a single group of data. It prints summary statistics, frequency tables and histograms. Input numbers are separated by any amount of white space, so format does not matter. Here is an input to DESC: Input to DESC: DESC.DAT 1 2 3 4 5 6 7 8 9 3 4 5 6 7 8 7 6 5 4 3 7 8 9 8 7 8 9 8 7 8 9 8 The following command requests order and regular statistics with the -o option, and a histogram (-h) with bins one apart (-i 1), with midpoint on unit boundaries (bins start at .5). desc -o -h -i 1 -m .5 < desc.dat Output 1: DESC - Descriptive Statistics ------------------------------------------------------------ Under Range In Range Over Range Missing Sum 0 32 0 0 199.000 ------------------------------------------------------------ Mean Median Midpoint Geometric Harmonic 6.219 7.000 5.000 5.657 4.811 ------------------------------------------------------------ SD Quart Dev Range SE mean 2.239 1.750 8.000 0.396 ------------------------------------------------------------ Minimum Quartile 1 Quartile 2 Quartile 3 Maximum 1.000 4.500 7.000 8.000 9.000 ------------------------------------------------------------ Skew SD Skew Kurtosis SD Kurt -0.616 0.433 2.214 0.866 ------------------------------------------------------------ Null Mean t prob (t) F prob (F) 0.000 15.709 0.000 246.760 0.000 ------------------------------------------------------------ Midpt Freq 1.000 1 * 2.000 1 * 3.000 3 *** 4.000 3 *** 5.000 3 *** 6.000 3 *** 7.000 6 ****** 8.000 8 ******** 9.000 4 **** 4.2.2. PAIR: Paired Data Analysis PAIR analyzes paired data by computing paired data statistics and by plotting one variable against the other in a scatterplot. The input should have two columns of data. That is, each line should have two paired data points, separated by white space. In this example, |STAT data manipulation programs will generate inputs to PAIR. For the first 100 positive integers, we will plot their square roots against their natural logarithms. The command is shown below. series 1 100 | dm sqrt(x1) log(x1) | pair -sp -h 10 -w 30 -x sqrt -y log The command is long, but the parts are easy to read. SERIES creates a series of 1 to 100, and DM prints two columns, one with the square root of the output from SERIES, the other with the natural logarithm. The output from DM is piped to PAIR, which has several options. The -s option requests summary statistics, and the -p option requests a plot. The -h 10 option sets the plot height to 10, and the -w 30 option sets the plot width to 30. Axis labels are given with the -x and -y options. Output 2: PAIR - Paired Data Analysis Analysis for 100 points: sqrt log Difference Minimums 1.0000 0.0000 0.6137 Maximums 10.0000 4.6052 5.3948 Sums 671.4629 363.7394 307.7235 SumSquares 5049.9989 1408.3305 1160.1631 Means 6.7146 3.6374 3.0772 SDs 2.3385 0.9281 1.4676 t(99) 28.7138 39.1938 20.9681 p 0.0000 0.0000 0.0000 Correlation r-squared t(98) p 0.9621 0.9256 34.9239 0.0000 Intercept Slope 1.0736 0.3818 |------------------------------|4.60517 | 4555666| | 1454451 | | 23342 | | 12331 | | 222 |log | 12 | | 12 | | 1 | | 1 | |1 | |------------------------------|0 1.000 10.000 sqrt r= 0.962 Summary statistics are reported for each input column and their difference. The differences in this example are almost meaningless and should be ignored. The simple linear regression is a good example that a high correlation (.96) does not always mean that a linear model is appropriate. From the plot, it is clear the relationship is non-linear. The scatterplot uses digits to display the number of input data points on one plot location. Options to set the axis limits can over-ride the default of using extreme data values. 4.2.3. ONEWAY: Single Factor Analysis of Variance ONEWAY compares the means from several groups using a one-way analysis of variance. Each groups' data are separated by a special value called the splitter. Here is a sample input to ONEWAY: Input to ONEWAY: ONEWAY.DAT 1 2 3 4 5 6 7 8 9 999 3 4 5 6 7 8 7 6 5 4 3 999 7 8 9 8 7 8 9 8 7 8 9 8 999 In the file ONEWAY.DAT, there are three groups separated by the value 999. This file contains the same data as input to DESC in an earlier example. It could be formed using |STAT data manipulation programs by appending the value 999 to every line in DESC.DAT. dm INPUT 999 < desc.dat > oneway.dat This command uses the DM to print each input line using the special INPUT variable, followed by the constant expression 999. Even though it is possible to add the splitter value with tools, it could also be done using a text editor on a copy of the data. The format of the input to ONEWAY does not matter as long as there is white space between each datum or splitter. The analysis of these data is simple. The -p option requests a plot, the -w option sets the plot width to 50, and the -s 999 option sets the group splitter to 999. oneway -p -w 50 -s 999 < oneway.dat Output 3: ONEWAY - Analysis of Variance Name N Mean SD Min Max Group-1 9 5.000 2.739 1.000 9.000 Group-2 11 5.273 1.679 3.000 8.000 Group-3 12 8.000 0.739 7.000 9.000 Total 32 6.219 2.239 1.000 9.000 Group-1 |<------============(====#=====)============------>| Group-2 | <---=======(==#==)========-----> | Group-3 | <-===(#=)===>| 1.000 9.000 Weighted Means Analysis: Source SS df MS F p Between 61.287 2 30.643 9.436 0.001 *** Within 94.182 29 3.248 Default group names are chosen by ONEWAY, but others could have been supplied on the command line. Individual group and overall summary statistics are followed by the error bar plots showing the extreme values , one standard deviation bars (=signs=), one standard error parentheses, and the means (#). The default significance test uses the weighted means solution for unequal group sizes, but the unweighted means solution can be requested instead. 4.2.4. RANKIND: Rank-Order Analysis for Independent Groups RANKIND is the non-parametric counterpart ot ONEWAY. To demonstrate this, the same analysis on the same data will be done. In this case, the options to RANKIND are the same as for ONEWAY. Note the similarities of the measures of central tendency, the plots, and the significance levels. rankind -p -w 50 -s 999 < oneway.dat Output 4: RANKIND - Rank-Order Analysis of Independent Groups N NA Min 25% Median 75% Max Cond-1 9 0 1.00 2.75 5.00 7.25 9.00 Cond-2 11 0 3.00 4.00 5.00 6.75 8.00 Cond-3 12 0 7.00 7.50 8.00 8.50 9.00 Total 32 0 1.00 4.50 7.00 8.00 9.00 Cond-1 |< --------------#--------------- >| Cond-2 | < ------#----------- > | Cond-3 | < ---#--- >| 1.000 9.000 Median-Test: Cond-1 Cond-2 Cond-3 above 2 1 9 12 below 6 8 0 14 8 9 9 26 WARNING: 6 of 6 cells had expected frequencies less than 5 chisq 16.387566 df 2 p 0.000276 Kruskal-Wallis: H (not corrected for ties) 13.123192 Tie correction factor 0.973424 H (corrected for ties) 13.481479 chisq 13.481479 df 2 p 0.001182 *5. Combining Programs in BATCH Files In this part of the overview, I describe the possibilities for ordinary users to write their own programs with |STAT and MSDOS utilities. In so doing, I will show how |STAT data manipulation and analysis programs work together. *5.1. Introduction to BATCH Files Much data analysis is routine, with similar steps used for similar data, and it is useful to be able to repeat a common analysis with different data sets. |STAT programs do not have any built-in programming facility, but instead work with the standard BATCH command script facility built into MSDOS, and the shell programming languages that come with UNIX. In this section, I will describe some typical BATCH files for data analysis. A BATCH file contains a sequence of commands. The inspiration for a particular BATCH file comes from observing similar sequences of commands several times. Perhaps the input files differ, or certain variables change, but the pattern remains. When a pattern is saved in a BATCH file, the things that change (e.g., file names, column numbers) are replaced by variables that are passed to the BATCH command from the command line. I think that BATCH command scripts are underused, perhaps because BATCH is not well known, so I will summarize the main features. MSDOS BATCH files have names that end in .BAT. This lets the MSDOS command interpreter recognize them as files with commands in them. Lines in BATCH files are commands, or labels beginning with : (colon). When a BATCH file is run, it is called like any compiled program. Parameters can be passed into a command script on the command line, and their values are accessed inside the BATCH file by expressions of the form: %N, where N is the position number of the parameter. For example, if the BATCH file, MYBATCH.BAT, were run with the command mybatch Hello Caroline then, inside the BATCH file, %1 would be Hello and %2 would be Caroline. I will not be using all the BATCH commands in Table 4, but the table gives an idea of the sort of programming possible. Full details about BATCH are available in the MSDOS manual. Table 4: BATCH Commands ----------------------- echo control printing of commands, print text for iterative execution of commands goto transfer control to a label if conditional execution of commands shift shift positional parameters pause wait for key press from user rem comment line *5.2. Example BATCH Files The following BATCH commands on files with several columns, some of which contain data for a variable, some labels. 5.2.1. Describing Selected Columns in a File To get descriptive statistics for a column in a file, COLEX can extract the column and DESC can print descriptive statistics. The following BATCH command, DESCOL, uses the name of a file and a column number. The initial ECHO command annotates the output, and shows the correct number and order of arguments. DESCOL.BAT: Describe Selected Columns ------------------------------------- echo descol file=%1 column=%2 colex %2 < %1 | desc -oh 5.2.2. Histograms and Scatterplot of Two Variables Suppose we have some paired data: two variables in a multiple-column data file. A good screening of the data would include a histogram for each variable, and a scatterplot of one against the other. If this analysis is going to be done often, then it makes sense to put the commands in a BATCH file like TWOPLOT. TWOPLOT.BAT: Paired Data Plotting --------------------------------- echo twoplot file=%1 var1=%2 var2=%3 echo Histogram of Variable %2 in %1 colex %2 < %1 | desc -h echo Histogram of Variable %3 in %1 colex %3 < %1 | desc -h echo Scatterplot of %2 Against %3 colex %2 %3 < %1 | pair -p -x %1[%2] -y %1[%3] TWOPLOT takes three arguments: the file name, the column number of the first variable, and the column number of the second. We use ECHO commands to annotate the BATCH file output. COLEX extracts the named columns for input to DESC, which produces histograms. The final command shows COLEX extracting both columns and piping them to PAIR. Labels using the file name and the column numbers are given to the X and Y axes. 5.2.3. 2x2 Contingency Tables The most common contingency table is one with two rows and two columns. For this simple case, the CONTAB multi-way contingency table program input format is cumbersome because more indexes must be supplied than data. A batch file can hide many details. 2X2.BAT: 2 by 2 Contingency Table Analysis ------------------------------------------ echo A=%1 B=%2 C=%3 D=%4 echo 1 1 %1 1 2 %2 2 1 %3 2 2 %4 | maketrix 3 | contab %5 %6 5.2.4. Regression of Data in Many Files Data may be kept in separate files, one variable in each one-column file. Suppose that you want to run a regression to predict the data in one file from several others. The ABUT program could be called to join the corresponding lines, and this could be piped to REGRESS. In the following BATCH file, as many as nine files (variables) are made into a matrix format file, which is piped to REGRESS. The file names are used as the variable names in the call to REGRESS. If less than nine files are supplied, the high variables like %9 are empty, and ignored by the programs. FREG.BAT: Regression of Variables in Files ------------------------------------------ echo freg files ... echo Joining files %1 %2 %3 %4 %5 %6 %7 %8 %9 echo Predict %1 with %2 %3 %4 %5 %6 %7 %8 %9 abut %1 %2 %3 %4 %5 %6 %7 %8 %9 | regress %1 %2 %3 %4 %5 %6 %7 %8 %9 5.2.5. Plotting Residuals From Regression It is good practice to plot the residuals from linear regression. Residuals are the differences between the obtained and predicted scores, and are usually plotted against the scores predicted by regression. The correlation between predicted scores and residuals is zero, but a visual inspection of the plot can sometimes detect non-linear trends not apparent in the original data. There is no plotting option built into the REGRESS program. Instead, there is an option (-e) to save the column number of the predicted variable and the multiple regression equation for predictions in the file: REGRESS.EQN. This file, along with the original data, can be used by DM to produce the original and predicted values. In the RESID BATCH file, a second pass of DM transforms the obtained and predicted scores to predicted and residual scores. A BATCH REMinder annotates each transformation. RESID.BAT: Regression with Residual Plot ---------------------------------------- echo resid file=%1 regress -e < %1 rem print Y Y' print Y' Y-Y' plot Y' Y-Y' dm Eregress.eqn < %1 | dm x2 x1-x2 | pair -p -x Predicted -y Residual 5.2.6. Plotting Functions |STAT programs can graph mathematical functions. SERIES can generate the domain and DM can compute the range of most functions. In the PLOTFUN BATCH file, a DM expression is followed by the low and high values of the domain. PAIR plots the values, using the DM expression as the Y axis label. Optionally, an increment other than the default of 1 can be supplied to generate domain values with more or less granularity. The examples show the advantage of interactive data analysis. Several minor variations on a command can be tried in a minute. PLOTFUN.BAT: Plot a Function ---------------------------- echo plotfun function=%1 low=%2 high=%3 [increment=%4] series %2 %3 %4 | dm x1 %1 | pair -p -x x1 -y %1 To plot the squares of the integers from -10 to 10, you would type: plotfun x1*x1 -10 10 This would generate the command: series -10 10 | dm x1 x1*x1 | pair -p -x x1 -y x1*x1 To get more detail by plotting points every tenth, you would run: plotfun x1*x1 -10 10 0.1 Finally, here is a function that looks like a butterfly: plotfun sin(x1)*x1 -20 20 .1 5.2.7. Statistical Tables |STAT programs can generate many tables like those at the end of statistics texts. DM can produce random numbers and compute common functions like cosine, square root, inverse, and logarithm. For example, the following command generates a table of values, their squares, inverses, square roots, and logarithms. series 1 100 .1 | dm x1 x1*x1 1/x1 sqrt(x1) log(x1) The PROBDIST program can compute critical values for several probability distributions. For example, the following computes the critical value for a .05-level t-statistic with 35 degrees of freedom. probdist crit t 35 .05 This is a single value, and it is easy to write a BATCH file to generate a table for critical t-distribution values. The BATCH file CRITABLE.BAT begins with ECHO commands to label the output table. The working part of the command is on the last line. A series of low degrees of freedom to high is generated by SERIES. This is piped to DM, which adds the strings `crit' and `t' before the degrees of freedom from SERIES. Each line in the output from DM looks like: crit t 35 .05 This is the format required for PROBDIST. PROBDIST produces critical values, and the -v option produces verbose output with labels on values. The batch file would be called with a command line like: critable 1 30 .01 CRITABLE.BAT: Critical t Values ------------------------------- echo Critical values for t for alpha = %3 series %1 %2 | dm 'crit' 't' x1 %3 | probdist -v *6. Technical Notes *6.1. Comparison with Other Packages |STAT is not a comprehensive package like SPSS (Nie, et al, 1975), BMD (Dixon, 1977), or SAS (Helwig & Council, 1979), which have versions that run on PC's, but was developed to meet data analysis needs as they arose. |STAT is lacking in multivariate analysis and graphics. Still, |STAT performs the analyses most used in education and research. It is comparable to the popular MINITAB package (Ryan, Joiner, & Ryan, 1976). In a sense, |STAT is cream- skimming because it does not try to deal with large datasets, or the needs of sophisticated users. The only multivariate analysis program is the multiple linear regression program, REGRESS. There are no factor, discriminant, or canonical correlation analyses, and no cluster or multidimensional scaling analyses. None are likely to be added in the next few years. The ANOVA program handles one variable and no covariates, although analysis of covariance can be approximated by combining ANOVA with REGRESS. Plots in |STAT programs are poverty graphics--graphics on a low budget. Although they are not pretty, they are effective. Given the simple format of |STAT files, and the availability of high quality graphics packages that interface to such files, such facilities are not likely to appear in |STAT. Despite lacking functionality found in many packages, there are strong advantages to |STAT. Few packages run on both UNIX and MSDOS, two popular systems. This portability, combined with the low cost of |STAT, makes it attractive for use in research and educational facilities. |STAT programs are used even in environments where more powerful packages are available. |STAT programs do not put the user into any special environment, which makes them ideal for quick exploratory data analysis. The data manipulation programs work with many MSDOS and UNIX utilities, and the analysis programs allow extremely rapid analyses. Few packages can allow the following analysis to be done in less than 10 seconds, when you count the time to start up a program, enter data, and get results. desc desc: reading input from keyboard: 32 48 47 23 55 14 23 ^Z (output follows practically immediately) This would lose the data, so a user might prefer to save it in a file first, either with a text editor, or with the MSDOS COPY utility. copy con desc.in 32 48 47 23 55 14 23 ^Z desc < desc.in *6.2. User Interface The basic goal of data analysis is to draw valid conclusions about data. If data are input in the wrong format, or the wrong analysis is selected, then invalid conclusions may be the result. The user interface provided by |STAT is designed to promote correct analysis. Earlier, I stated that the main user interface principles in |STAT are simplicity, consistency, robustness, and feedback. In this section, I explain how these principles are realized. 6.2.1. Simplicity |STAT is a package of simple programs. |STAT does not have the full capabilities of packages like SPSS (Nie, et al, 1975), BMD (Dixon, 1977), or SAS (Helwig & Council, 1979), so it is not as complete a statistics package, but |STAT offers the most used analyses. A full data analysis package has trouble addressing the needs of professional analysts while still presenting a simple structure to other users, which is a primary goal of the |STAT package. 6.2.2. Consistency |STAT programs are designed to be externally and internally consistent. To be externally consistent, the programs adopt the UNIX tool philosophy of reading the standard input and writing the standard output. Program options follow the UNIX command line option standard (Hemenway & Armitage, 1984). When learned, this standard is easy to use, except that program options are single characters, which are not too memorable. The problem of cryptic options is alleviated by providing online help. For example, if the short usage summary for DESC was not enough, you could get descriptions and current values of all the options by running DESC with the -O option. Online Option Summary For DESC ------------------------------ desc: descriptive statistics and histograms -c cumulative frequencies or proportions FALSE -f request table of frequencies FALSE -F Ho F-test against mean Ho 0 -h request a histogram FALSE -i width interval width for tables & histograms 0 -m min minimum allowable value 0 -M max maximum allowable value 0 -o request order statistics FALSE -p request table of proportions FALSE -s request summary statistics FALSE -t Ho t-test against mean Ho 0 -v print statistics in name=value format FALSE |STAT programs are used in a familiar environment, not one that is inconsistent with other program use. This is unusual. Many packages may boast of having their own input editor, command language or menu interface, and programming language. In most cases, these interfaces are inconsistent with all other systems learned by users. |STAT does not assume that data analysis is done in isolation of all other systems, so when an existing system can be used, a new one is not created. The programs are internally consistent by using the same intuitive input formats and standard option naming conventions. The manual entries are also formatted consistently. Much of the consistency in the package is obtained by automatic code and documentation generation techniques (Perlman, 1986). 6.2.3. Robustness The programs are designed to be robust against user errors. All the analysis programs check for invalid data types and ranges, and give standard error messages identifying problems. An example error message from the PAIR program follows. pair: -w option value must be between 5 and 100 |STAT programs do all calculations in double precision, but they do not use many sophisticated numerical methods, nor the high level of adjusting for rounding errors and catching overflows found in the well known statistics packages. Other than that, |STAT programs do well on evaluations of statistical software such as the Cornell University Statistical Computing Support Group test (Cornell, 1985). |STAT programs prevent many errors people make in using the big packages. For example, a common error people make with old packages is to write an incorrect FORTRAN format statement to read the data, sometimes resulting in the program reading the wrong data. |STAT programs deal with format problems by having analysts use human readable input formats. Even in the newer large- package programs, the specification of experimental design can be complicated, and misused by inexperienced users. Design specification languages are avoided completely in |STAT by having programs infer design information from intuitive data formatting schemes. For example, the multiple regression program, REGRESS, requires its input to be a file with the predicted variable in the first column, followed by a series of predictors. The most impressive example of program design specification inference in |STAT is in the multi-factor ANOVA program. ANOVA requires that each datum be preceded by labels describing the conditions under which it was obtained. From these lines, ANOVA figures out the experimental design (nesting/crossing of factors). 6.2.4. Feedback |STAT programs give users feedback about data formats by letting them see the format of the data at any stage of transformation or analysis. When programs are used interactively, users are given program- specific prompts for input. All |STAT messages identify the program they are from, important when several programs are combined in a pipeline, and they give diagnostic information to help fix errors. To give users fast access to information about each program, there are three levels of online help. The first level is a short program usage summary. The second level includes standard help options to tell users about program version numbers (-V), program data set size limits (-L), and program options: names, descriptions, and current values (-O). For example, the limits for the PAIR program are printed with the -L option. pair -L pair: program limits: 1000 maximum number of pairs for plots 100 maximum width of plot 100 maximum height of plot 5 minimum plot height or width 512 maximum number of characters in lines The latest version of REGRESS is found with the -V option. regress -V Program: regress Version: 5.3 Date: 11/25/86 These built-in options give accurate information for programs even if the documentation has lagged behind. The third level of online help includes full program manual entries online; these are displayed with the MANSTAT program. *6.3. Hardware and Software Requirements |STAT runs on two popular operating systems: MSDOS and UNIX. |STAT runs on any operating systems compatible with MSDOS versions 2 and 3. |STAT programs run on any UNIX system with a C compiler. The programs use few and only standard system calls, and do not require any special hardware, making |STAT very portable. |STAT has small memory requirements. The programs and online documentation take up about 900K on MSDOS, and 500K on a VAX running UNIX. The largest programs on MSDOS are about 60K on a disk, so running the programs with one floppy disk is possible, although a hard disk would be faster. |STAT programs using dynamic memory allocation will try to fit all their input data in main memory, so they can fit as much data as there is memory. The programs are not designed for large data sets, but are robust with several thousand points. |STAT programs can work with or without a math coprocessor or floating point accelerator, but computationally intensive programs will naturally run faster with special hardware. *6.4. |STAT Distribution |STAT is distributed under a liberal copyright. Users can make make and distribute unlimited copies of any part of the package, provided |STAT is not distributed for gain. |STAT users are not permitted to create derivative works based on the C source code; source is distributed for compilation only. Experience with users writing their own programs based on |STAT has been bad: Some well-meaning people wrote and distributed programs with serious bugs in them. |STAT is distributed without warranty, and the program users bear all risks and costs. Any part of |STAT may be changed at any time without warning. |STAT is distributed only under these terms. |STAT was developed at the University of California, at San Diego, and at the Wang Institute of Graduate Studies. Versions of |STAT have been sent to over 500 UNIX sites and there have been over 300 MSDOS distributions after Fred Horan first ported the 5.1 release of the package to MSDOS at Cornell University in 1985. The MSDOS edition includes executable programs and online manual entries. The package runs on almost any IBM PC compatible running version 2 or 3 of MSDOS. The UNIX version of the package is distributed on tape and includes C (Kernighan & Ritchie, 1979) source files and online manual entries. The package runs on almost any UNIX lookalike system with a C compiler. All distributions of |STAT materials include worldwide postage. The UNIX version of |STAT is distributed for $20, and includes a nine track half inch magnetic tape in 1600 bpi tar format. The MSDOS version is distributed on double-sided double-density floppy diskettes for $15. The |STAT handbook (with examples, conventions, program summaries, and DM and CALC manuals) costs $10. Orders for the programs must be prepaid to G. Perlman at the Department of Computer and Information Science, The Ohio State University, Columbus, OH 43210 USA. Orders must include a check in U.S. funds drawn on a U.S. bank and should be accompanied by a clear international mailing label. *7. References Bradley, J. V. (1968) Distribution-Free Statistical Tests. Englewood Cliffs, NJ: Prentice-Hall. Brooks, C. A. (1985) Experiences with Electronic Software Distribution. In Summer USENIX Conference. El Cerito, CA: Usenix Ass'n. pp. 433-436. Coombs, C. H., Dawes, R. M., & Tversky, A. (1970) Mathematical Psychology: An Elementary Introduction. Englewood Cliffs, NJ: Prentice-Hall. Cornell, U. (1985) Software Evaluation Form: Statistical Computing Support Group. Ithaca, NY: Cornell University. Dixon, W. J. (1975) BMD-P Biomedical Computer Programs. Berkeley, CA: University of California Press. Helwig, J. T., & Council, K. A. (Eds.) (1979) SAS User's Guide. Cary, NC: SAS Institute. Hemenway, K., & Armitage, H. (1984) Proposed Syntax Standard for UNIX System Commands. In Summer USENIX Conference. El Cerito, CA: Usenix Association. (Washington, DC.) Keppel, G. (1973) Design and Analysis: A Researcher's Handbook. Englewood Cliffs, NJ: Prentice-Hall. Kerlinger, F. N., & Pedhazur, E. J. (1973) Multiple Regression in Behavioral Research. New York, NY: Holt Rinehart Winston. Kernighan, B. W., & Ritchie, D. M. (1979) The C Programming Language. Englewood Cliffs, NJ: Prentice-Hall. Korn, D. G. (1983) KSH: A Shell Programming Language. In Summer USENIX Conference. El Cerito, CA: Usenix Association. pp. 191-202. Nie, H. H., Jenkins, J. G., Steinbrenner, K., & Bent, D. H. (1975) SPSS: Statistical Package for the Social Sciences. New York: McGraw-Hill. Perlman, G. (1986) Multilingual Programming: Coordinating Programs, User Interfaces, On-Line Help, and Documentation. ACM SIGDOC Asterisk. 123-129. Perlman, G., & Horan, F. L. (1986) Report on |STAT Release 5.1 Data Analysis Programs for UNIX and MSDOS. Behavior Research Methods, Instruments, & Computers, 18.2, 168-176. Perlman, G. (1987) The |STAT Handbook. (3rd Edition). Tyngsboro, MA: Wang Institute of Graduate Studies. Ritchie, D. M., & Thompson, K. (1974) The UNIX Time-Sharing System. Communications of the Association for Computing Machinery, 17:7, 365-375. Ryan, T. A., Joiner, B. L., & Ryan, B. F. (1976) MINITAB Student Handbook. North Scituate, MA: Duxbury Press. Siegel, S. (1956) Nonparametric Methods for the Behavioral Sciences. New York: McGraw-Hill.