|STAT: Data Manipulation and Analysis Programs for MSDOS and UNIX
A Tutorial Overview of Release 5.3
Gary Perlman

Copyright 1986, 1987 Gary Perlman

Permission to copy without fee all or part of this material is granted provided
that the copies are not made for direct commercial advantage, the above
copyright notice and the title of the publication and its date appear, and
notice is given that copying is by permission of Gary Perlman.  To copy
otherwise, or to republish, requires a fee and/or specific permission.

*1.  Introduction

     |STAT is a collection of over 25 cooperating data manipulation and
analysis programs running on the MSDOS and UNIX operating systems.  In this
paper, I will give examples of the main features of |STAT programs and how they
work together.  |STAT is an interesting package for several reasons.

Portability
     |STAT programs run on all MSDOS and UNIX systems.

Familiarity
     |STAT programs integrate into existing familiar environments and do not
     require users to learn many new systems.  |STAT input files are plain text
     files and can be edited with almost any text editor or word processing
     software, so no new input editor need be learned.  |STAT programs are run
     from the UNIX command shells and MSDOS COMMAND.COM, so no new command
     system need be learned.  |STAT does not introduce its own programming
     language, but depends on the existing capabilities of MSDOS BATCH files
     and UNIX shell scripts.  Finally, |STAT does not need to recreate existing
     functionality because |STAT programs can be used with standard utilities.

Usability
     The programs are easy to use because |STAT uses simple, consistent
     input/output, user interface, and other conventions throughout the
     package, more so than most UNIX or MSDOS utilities.  Most users can run
     simple analyses within a few minutes of first seeing the documentation.

Versatility
     The package makes heavy use of the UNIX design philosophy of building
     simple tools that can be combined to do complex tasks.  Most |STAT
     programs are tools that can be used with other MSDOS or UNIX utilities on
     tasks not related to data analysis.

Applicability
     |STAT provides the most commonly used analyses, especially those taught in
     undergraduate statistics, although it is not a comprehensive statistical
     package.

Field-Tested
     Versions of |STAT has been used at hundreds of educational, research,
     industrial, and government sites by thousands of users since 1980.

Affordability
     The package is distributed under a liberal copyright and for a low one-
     time distribution cost.  Unlimited copies of the programs and
     documentation can be made and distributed, provided such copies are not
     made for gain.  This policy makes the package ideal for teaching, because
     free copies can be made for students who can take |STAT with them at the
     end of a course.

*2.  Overview of |STAT

*2.1.  Introductory Examples

     In these examples, I hope to show how |STAT programs work together.  I
will go through increasingly complicated examples, explaining new features as I
use them.  I will show how the examples would look on MSDOS, rather than UNIX,
but the form of commands is similar for both systems.  By the end of these
examples, you should understand how to construct complicated commands by
combining data manipulation and analysis tools.

     In all the examples, I will use upper-case letter names to refer to
program and file names in the text.  In commands set apart from the text, I
will use lower-case names.  The MSDOS command interpreter is insensitive to
case differences in program and file names, so in practice, lower case would be
typed.

     To do data analysis, we need data.  Rather than collect data, or make up a
realistic example, I will use a series of numbers between 1 and 100, a data set
that is easy to understand.  There is a |STAT utility to generate a series of
numbers, called SERIES.  The following command prints the series of 1 to 100
(1, 2, 3, ..., 100), with each number on a new line.

    series 1 100

This series prints out on the screen, the default or ``standard output'' for
all |STAT programs.  The series could be saved in a file by using ``output
redirection.'' The following command generates a series of numbers from 1 to
100 and saves the output in a file called NUMBERS.DAT.  The notation
``> filename'' tells the command interpreter to place the output of a command
in the named file, rather than on the screen.

    series 1 100 > numbers.dat

     We can get summary statistics on these numbers with the STATS command.  By
default, |STAT programs read their input from the keyboard, but the default or
``standard input'' can be redirected from a file.  The following command says
for STATS to print the mean and standard deviation (sd) of the data in the file
NUMBERS.DAT.  The notation ``< filename'' causes a program to read from the
named file, rather than from the keyboard.

    stats mean sd < numbers.dat

STATS reads numbers in ``free-format.'' That is, it looks for numbers separated
by any white space (i.e., blank spaces, tabs, new lines).  The output from the
STATS program is one line with the mean and standard deviation of the numbers
in NUMBERS.DAT.

    50.5    29.0115

     Many |STAT programs can read data like that in NUMBERS.DAT.  DESC is a
descriptive statistics program that prints many summary statistics about its
free-format input.  The following command says that DESC will read from the
file NUMBERS.DAT, and put its output in the file RESULTS.

    desc < numbers.dat > results

These results could be printed on paper with the standard PRINT command.

    print results

A similar result could be obtained by sending the output of DESC directly to
the printer device, PRN.

    desc < numbers.dat > prn

     Now suppose you want to analyze transformed data, rather than the raw
data.  Most |STAT programs do not have built-in transformation facilities.  The
most important |STAT program that specializes in transformations is DM, the
data manipulator.  Suppose we want to summarize the natural logarithms of the
numbers in NUMBERS.DAT.  The following command tells DM to print the logarithm
of the first numeric column (x1) of its input from NUMBERS.DAT.  To DM, and
other |STAT programs, a column is any text on a line, separated from other text
by white space.

    dm log(x1) < numbers.dat

This transformed data could then be analyzed with an analysis program, such as
STATS or DESC.

     To transform a data set, never output to the input file like this:

    dm log(x1) < numbers.dat > numbers.dat

because the input file would get erased before it could be read.  The only way
to do the job is with a temporary file.

     MSDOS (version 2.0 or later) and UNIX both give users the ability to take
the output of one program and make it the input to another.  This is called
pipelining, because it is as though there is a pipe conveying the information
(data) from one program directly to another.  We could have computed the mean
and standard deviation of the numbers from 1 to 100 with the following
pipeline.

    series 1 100 | stats mean sd

The pipe symbol, the vertical bar, |, says to take the output of the preceding
program (SERIES), and make it the input to the following program (STATS).  The
command to transform with DM and analyze the series with DESC would look like
this:

    dm log(x1) < numbers.dat | desc

The heavy use of the pipe symbol is why the package is named |STAT.  Piping is
about the same as saving the transformed numbers in a temporary file, making
the file the input to DESC, and then removing the temporary file.  This longer
but equivalent form is shown below.

    dm log(x1) < numbers.dat > tmp
    desc < tmp
    del tmp

     Pipelines allow users to avoid creating and removing temporary files, so
creating the file NUMBERS.DAT could have been avoided.  The analysis of the
logarithms of the first 100 natural numbers could be done with the following
pipeline.

    series 1 100 | dm log(x1) | desc

     This is a long command line for most MSDOS users, and with |STAT commands
it is useful to be able to edit (insert and delete characters) within the
current and previous command lines.  There are several public domain command
line editors available for MSDOS, and for UNIX, KSH (Korn, 1983) can be
purchased from AT&T through the UNIX toolchest (Brooks, 1985).

     The next few examples will show how to control program outputs with
options.  When no options are supplied, |STAT programs assume the ones most
often requested.  Most programs have options to control what programs produce,
and all programs with options have built-in option summaries.  |STAT program
options follow the UNIX conventions of being single letters preceded by a minus
sign.  The DESC descriptive statistics program can print summary statistics,
frequency tables, and histograms.  To get a histogram of the data in the file
NUMBERS.DAT, you would run DESC with the -h (histogram) option.

    desc -h < numbers.dat

If you want both summary statistics and a histogram, you would use the -s and
-h options.  This could be typed as:

    desc -s -h < numbers.dat

or as:

    desc -sh < numbers.dat

because logical (on/off) options can be ``bundled.'' Besides logical options,
there are options for which values must be supplied, such as the width of a
plot.

*2.2.  Package Conventions

     |STAT programs are designed for human efficiency first, and then program
efficiency.  A program's performance, especially on a single-user system,
should not be measured without considering how long it takes a person to get it
running correctly.  An early reason for writing |STAT programs was that with
many packages it took too long to write the request for the desired analyses,
often with false attempts.

     The main design feature of |STAT programs is that each program is a
specialized tool.  These tools are designed to cooperate, so they can be
combined to do many tasks.  Functionality found in many data analysis programs,
such as data transformations, are in separate specialist tools.  Some overlap
is necessary: To insure that analysis program inputs are reasonable, data type
and range checking must be built into all analysis programs.  One result of the
tool design philosophy for |STAT is that analyses take the form shown below.

             EXTRACT
    DATA -> TRANSFORM -> ANALYSIS -> RESULT
             FORMAT

     An analysis begins with raw data, or data generated by a program like
SERIES, which generates a series of numbers.  A subset of the data is extracted
(copied, really), transformed, or formatted, to be ready for input to an
analysis program, which produces results.

     |STAT program user interfaces are not flashy, but they are effective.  The
main design principles are simplicity, consistency, robustness, and feedback.
After the examples, I will try to make these design principles more concrete.

     |STAT programs make heavy use of the standard input and standard output,
using the <, |, and > notation in the MSDOS and UNIX command interpreters.  One
benefit of this is uniformity; users do not need to learn many rules for
reading and writing files.  Another benefit is that |STAT programs can work
together, and with standard MSDOS and UNIX utilities (any that read the
standard input and write the standard output).

     As I have emphasized, |STAT programs are specialized tools, and I have
tried to avoid duplicating the functions handled by existing tools.
Sophisticated graphics or data entry are handled by graphics packages or
generic text editing programs, respectively.  |STAT programs do not assume any
unusual data file format, but use human-readable files, so |STAT users do not
have to learn how to use yet another text editor.  |STAT programs can work with
almost any program that reads plain text files.  Perhaps the greatest benefit
of the |STAT adoption of the UNIX tool development philosophy is that the tools
can be combined in BATCH files on MSDOS, and shell scripts on UNIX.  These give
|STAT users a simple, universally available programming language, examples of
which are given later.

3.  Data Manipulation Programs

     |STAT data manipulation programs can work with standard MSDOS and UNIX
tools.  |STAT data files are human readable text, so users can use any text
editor to enter or modify data.  Also, all the standard file manipulation
programs available on UNIX and MSDOS are compatible.  In Table 1, we see that
there are many existing file handling programs available.  These work smoothly
with |STAT programs, and are not duplicated in |STAT.

    Table 1: Standard MSDOS and UNIX Utilities
    ------------------------------------------
    MSDOS     UNIX      Purpose
    cd        cd, pwd   change/print working directory
    comp      diff      compare and list file differences
    copy      cp        copy files
    del,erase rm        remove/delete files
    dir       ls        list files in directory
    echo      echo      print text to standard output
    find      grep      search for pattern in files
    mkdir     mkdir     make a new directory
    more      more      paginate text on screen
    print     print     print files on line printer
    rename    mv        move/rename files
    rmdir     rmdir     remove directory
    sort      sort      sort lines in files
    type      cat       print files to standard output

     The |STAT data manipulation programs are summarized in Table 2.  There are
programs for data generation, numerical transformations, formatting, extraction
of subsets of data, and validation.  Many programs have several uses.

    Table 2: Data Manipulation Programs
    -----------------------------------
    abut      join data files beside each other
    colex     column extraction/formatting
    dm        conditional data extraction/transformation
    dsort     multiple key data sorting filter
    linex     line extraction
    maketrix  create matrix format file from free-format input
    perm      permute line order randomly, numerically, alphabetically
    probdist  probability distribution functions
    ranksort  convert data to ranks
    repeat    repeat strings or lines in files
    reverse   reverse lines, columns, or characters
    series    generate an additive series of numbers
    transpose transpose matrix format input
    validata  verify data file consistency

*3.1.  Data Manipulation Program Descriptions

     There are too many programs to describe in detail here, but I'll describe
the important features of each.  Full documentation is in the |STAT Handbook.

     ABUT takes several files and joins the corresponding lines in each.  It
builds inputs for analysis programs like ANOVA and REGRESS that read multi-
column inputs.

     COLEX extracts ranges of white space separated columns.  It also formats
columns so that you can control the width of fields, left or right
justification, and alignment of decimal points.

     DM is by far the most important data manipulation tool, and only a small
part of its versatility will be shown in examples.  DM takes a series of
expressions involving input columns, and for each expression, DM prints an
output column containing the values of the expression.  DM has most arithmetic
and transcendental functions, string operations, condition testing, and control
of flow.  For example, DM can print all the lines with some specified text in a
specified column, or print the sum of all the numerical columns on each line.

     DSORT reorders lines in a data file according to the order of values in
any columns.  Ordering can be numerical or alphabetical, in increasing or
decreasing sequence.

     LINEX extracts individual lines and ranges of lines by line numbers.

     MAKETRIX creates a matrix format file by reading white space separated
words, and printing them so there there are an equal number of words per line.

     PERM prints a permutation (reordering) of its input lines.  By default,
PERM prints the lines in random order, but it can also print them in numerical
or alphabetical order.

     PROBDIST deals with several probability distributions: uniform, binomial,
normal, chi-square, t, and F.  For any of these distributions, it can generate
random numbers, compute the cumulative probability of obtaining a particular
value, or compute the distribution value needed for a particular cumulative
probability.

     RANKSORT converts the data in each of its input columns to ranks to allow
the application of some non-parametric methods.

     REPEAT generates multiple copies of its input.

     REVERSE reverses the order of lines in its input, fields (columns) in its
inputs lines, or characters within lines, in any combination.

     SERIES generates linear series between two values.  The series can be
ascending or descending.  By default, there is a difference of one between
adjacent series elements, but this can be changed to smaller or larger units.
Non-linear series can be generated by transforming a linear series with DM.

     TRANSPOSE flips the rows and columns of its input: the matrix transpose
operation.

     VALIDATA reports the data types of its input columns and where, if at all,
there is a change in the number of columns-per-line.  If a data analysis
program reports an input format error, VALIDATA can help find the problem.

*3.2.  Data Manipulation Examples

     The following examples will show how |STAT programs can be combined to
perform complicated manipulations.  Suppose you want to generate a series of
numbers in an inverse progression: 1/1 1/2 1/3 1/4 1/5 1/6 ....  The SERIES
program generates linear series, but the output from SERIES can be transformed
by DM.  The following command generates a series from 1 to 100 and computes the
inverse of each number in the series.

    series 1 100 | dm 1/x1

     Another series contains the squares of the first 100 integers:
1, 4, 9, ..., 10000.  This can be generated with SERIES and a different DM
transformation.  The following transformation by DM computes and prints the
squares of the values in its input column.

    series 1 100 | dm x1*x1

This series of squares could be transformed back to the original linear series
with another (the square root) transformation by DM.

    series 1 100 | dm x1*x1 | dm sqrt(x1)

     Suppose you want to generate a 5 by 5 matrix of random numbers selected
without replacement from 1 to 25.  This is a reformatted random permutation of
the numbers from 1 to 25.  First, you would generate the numbers from 1 to 25
with SERIES.  Then, you would permute the values with the PERM program.
Finally, you would make a matrix format file with the MAKETRIX program.

    series 1 25 | perm | maketrix 5

To sample from the same values, but with replacement, the PROBDIST (probability
distribution) routine is used.  The following command generates 25 values
uniformly distributed from 0.0 up to, but not including, 1.0.

    probdist rand uni 25

The range of these numbers can be transformed with DM by multiplying by 25 and
adding 1 to make the range 1.0 up to 25.99..., and by truncating the values to
their integer parts with the floor function.  Then MAKETRIX reformats the
values into a matrix.

    probdist rand uni 25 | dm floor(x1*25+1) | maketrix 5

     To get summary statistics on the last column of a file is easy if the file
has the same number of columns on each line.  Suppose the file MATRIX.DAT has
six columns.  We could use COLEX to extract the sixth column.

    colex 6 < matrix.dat | stats

But if a file does not have the same number of columns per line, or if the
number of columns is unknown, then REVERSE can reverse the columns on every
line so that COLEX can extract the first.  In the following example, REVERSE is
called with the -f option (reverse ``fields'' on a line) and the output is
piped to COLEX.

    reverse -f < matrix.dat | colex 1

     With no options, REVERSE reverses line order.  For example,

    series 1 10 | reverse

would print, 10, 9, ..., 1, with each number on a separate line.  Although this
is an intuitive example, it is not realistic, because SERIES can generate
descending series.  Calling SERIES with the operands 10 and 1 would have the
same effect.

     DM has special variables containing information about each line.  For
example, N is the number of columns, and SUM is the sum of all the numbers.
One way to compute the sum of the integers from 1 to 50 is to put them all on
one line, and print the SUM variable.  SERIES generates one number per line,
but the TRANSPOSE program can reformat an N-line 1-column file to a 1-line N-
column form, just what is needed by DM.

    series 1 50 | transpose | dm SUM

A faster way would be to use the sum request for STATS:

    series 1 50 | stats sum

     The MSDOS SORT utility is good for sorting lines alphabetically, but not
numerically.  For example, here is the output from SORT, transposed so that it
fits on one line, when SORT is given the integers from -5 to 15.

    series -5 15 | sort | transpose
    -1 -2 -3 -4 -5 0 1 10 11 12 13 14 15 2 3 4 5 6 7 8 9

Numerically, the first five numbers are descending, then there is 0 and 1, then
10 through 15, and then the rest of the single digits.  As character strings,
they are in the right order, but not as numbers.  The DSORT data sorting
program can sort numbers or strings, and it is smart enough to use the right
sorting method for the data.

    series -5 15 | dsort | transpose
    -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

     From these examples, you can see how the |STAT programs work together, and
with MSDOS utilities.  There are many ways of accomplishing the same task, and
the choice of one over the other can be for the sake of efficiency or
readability.  The possibilities for command construction increase with practice
and by reading the manual entries for individual programs and the |STAT
Handbook (Perlman, 1987).

*4.  Data Analysis Programs

     |STAT analysis programs compute descriptive and inferential statistics and
simple graphs.  There are programs for analyzing single variables, paired
variables, several numerical variables, and categorical variables.

    Table 3: Data Analysis Programs
    -------------------------------
    anova     multi-factor analysis of variance
    calc      interactive algebraic modeling calculator
    contab    contingency tables and chi-square
    desc      descriptions, histograms, frequency tables
    dprime    signal detection d' and beta calculations
    oneway    one-way anova/t-test with error-bar plots
    pair      paired data statistics, regression, scatterplots
    rankind   rank order analysis for independent conditions
    rankrel   rank order analysis for related conditions
    regress   multiple linear regression and correlation
    stats     simple summary statistics
    ts        time series analysis and plots

*4.1.  Data Analysis Program Descriptions

     ANOVA performs a multifactor analysis of variance with one random factor,
and up to nine other between groups (nested) or within groups (crossed)
factors.  Between groups factors can have unequal cell sizes, for which the
weighted means solution is used.  The method of analysis follows Keppel (1973).

     CALC puts the algebraic capabilities of DM into an interactive calculator.
It allows the definition of named variables, and what-if style changes to these
definitions.

     CONTAB is a multi-way cross-tabulation program with chi-square tests of
association.  The method of analysis follows Siegel (1956) and Bradley (1968).

     DESC prints descriptive statistics, t-test, frequency tables, and
histograms.

     DPRIME computes signal detection theory values for discrimination (d') and
bias (beta).  The method of analysis follows Coombs, Dawes & Tversky (1970).

     ONEWAY performs a one-way analysis of variance for which groups can have
unequal cell sizes.  Optionally, group error-bar plots are printed.

     PAIR analyzes and plots paired data.  Summary statistics and t-tests are
printed for each variable and their difference.  A simple linear regression
prints the regression equation, correlation coefficient, and significance test
for the correlation/regression.

     RANKIND analyses data measured on an ordinal scale from independent
groups.  order statistics are reported with tests of location such as the
median, Mann-Whitney, and Kruskal-Wallis tests, as described in Siegel (1956).

     RANKREL analyses data measured on an ordinal scale from related samples.
Order statistics are reported with tests of location such as the sign test,
Wilcoxon signed-ranks test, and Friedman anova of ranks, as described in Siegel
(1956).

     REGRESS performs multiple linear regression.  An optional analysis
determines if a predictor significantly improves the multiple R after other
predictors have been included.  The method of analysis follows Kerlinger &
Pedhazur (1973).

     STATS prints selected summary statistics with little if any annotation.

     TS does a simple time series analysis: autocorrelations, series rescaling,
and plots.

*4.2.  Data Analysis Examples
To understand all the examples, some knowledge of statistics is needed, but to
get a feel for how the programs are used, a little experience with MSDOS or
UNIX will be enough.

4.2.1.  DESC: Describe a Data Set

     DESC describes a single group of data.  It prints summary statistics,
frequency tables and histograms.  Input numbers are separated by any amount of
white space, so format does not matter.  Here is an input to DESC:

    Input to DESC: DESC.DAT
    1 2 3 4 5 6 7 8 9
    3 4 5 6 7 8 7 6 5 4 3
    7 8 9 8 7 8 9 8 7 8 9 8

The following command requests order and regular statistics with the -o option,
and a histogram (-h) with bins one apart (-i 1), with midpoint on unit
boundaries (bins start at .5).

    desc -o -h -i 1 -m .5 < desc.dat

    Output 1: DESC - Descriptive Statistics
    ------------------------------------------------------------
     Under Range    In Range  Over Range     Missing         Sum
               0          32           0           0     199.000
    ------------------------------------------------------------
            Mean      Median    Midpoint   Geometric    Harmonic
           6.219       7.000       5.000       5.657       4.811
    ------------------------------------------------------------
              SD   Quart Dev       Range     SE mean
           2.239       1.750       8.000       0.396
    ------------------------------------------------------------
         Minimum  Quartile 1  Quartile 2  Quartile 3     Maximum
           1.000       4.500       7.000       8.000       9.000
    ------------------------------------------------------------
            Skew     SD Skew    Kurtosis     SD Kurt
          -0.616       0.433       2.214       0.866
    ------------------------------------------------------------
       Null Mean           t    prob (t)           F    prob (F)
           0.000      15.709       0.000     246.760       0.000
    ------------------------------------------------------------
           Midpt    Freq
           1.000       1 *
           2.000       1 *
           3.000       3 ***
           4.000       3 ***
           5.000       3 ***
           6.000       3 ***
           7.000       6 ******
           8.000       8 ********
           9.000       4 ****

4.2.2.  PAIR: Paired Data Analysis

     PAIR analyzes paired data by computing paired data statistics and by
plotting one variable against the other in a scatterplot.  The input should
have two columns of data.  That is, each line should have two paired data
points, separated by white space.  In this example, |STAT data manipulation
programs will generate inputs to PAIR.  For the first 100 positive integers,
we will plot their square roots against their natural logarithms.  The command
is shown below.

    series 1 100 | dm sqrt(x1) log(x1) | pair -sp -h 10 -w 30 -x sqrt -y log

The command is long, but the parts are easy to read.  SERIES creates a series
of 1 to 100, and DM prints two columns, one with the square root of the output
from SERIES, the other with the natural logarithm.  The output from DM is piped
to PAIR, which has several options.  The -s option requests summary statistics,
and the -p option requests a plot.  The -h 10 option sets the plot height to
10, and the -w 30 option sets the plot width to 30.  Axis labels are given with
the -x and -y options.

    Output 2: PAIR - Paired Data Analysis
    Analysis for 100 points:
                                 sqrt              log       Difference
    Minimums                   1.0000           0.0000           0.6137
    Maximums                  10.0000           4.6052           5.3948
    Sums                     671.4629         363.7394         307.7235
    SumSquares              5049.9989        1408.3305        1160.1631
    Means                      6.7146           3.6374           3.0772
    SDs                        2.3385           0.9281           1.4676
    t(99)                     28.7138          39.1938          20.9681
    p                          0.0000           0.0000           0.0000

         Correlation        r-squared            t(98)                p
              0.9621           0.9256          34.9239           0.0000
           Intercept            Slope
              1.0736           0.3818
    |------------------------------|4.60517
    |                       4555666|
    |                 1454451      |
    |             23342            |
    |         12331                |
    |       222                    |log
    |     12                       |
    |   12                         |
    |  1                           |
    | 1                            |
    |1                             |
    |------------------------------|0
    1.000                     10.000
             sqrt  r= 0.962

Summary statistics are reported for each input column and their difference.
The differences in this example are almost meaningless and should be ignored.
The simple linear regression is a good example that a high correlation (.96)
does not always mean that a linear model is appropriate.  From the plot, it is
clear the relationship is non-linear.  The scatterplot uses digits to display
the number of input data points on one plot location.  Options to set the axis
limits can over-ride the default of using extreme data values.

4.2.3.  ONEWAY: Single Factor Analysis of Variance

     ONEWAY compares the means from several groups using a one-way analysis of
variance.  Each groups' data are separated by a special value called the
splitter.  Here is a sample input to ONEWAY:

    Input to ONEWAY: ONEWAY.DAT
    1 2 3 4 5 6 7 8 9       999
    3 4 5 6 7 8 7 6 5 4 3   999
    7 8 9 8 7 8 9 8 7 8 9 8 999

In the file ONEWAY.DAT, there are three groups separated by the value 999.
This file contains the same data as input to DESC in an earlier example.  It
could be formed using |STAT data manipulation programs by appending the value
999 to every line in DESC.DAT.

    dm INPUT 999 < desc.dat > oneway.dat

This command uses the DM to print each input line using the special INPUT
variable, followed by the constant expression 999.  Even though it is possible
to add the splitter value with tools, it could also be done using a text editor
on a copy of the data.

     The format of the input to ONEWAY does not matter as long as there is
white space between each datum or splitter.  The analysis of these data is
simple.  The -p option requests a plot, the -w option sets the plot width to
50, and the -s 999 option sets the group splitter to 999.

    oneway -p -w 50 -s 999 < oneway.dat

    Output 3: ONEWAY - Analysis of Variance
    Name          N     Mean       SD      Min      Max
    Group-1       9    5.000    2.739    1.000    9.000
    Group-2      11    5.273    1.679    3.000    8.000
    Group-3      12    8.000    0.739    7.000    9.000
    Total        32    6.219    2.239    1.000    9.000

    Group-1   |<------============(====#=====)============------>|
    Group-2   |            <---=======(==#==)========----->      |
    Group-3   |                                     <-===(#=)===>|
               1.000                                        9.000

    Weighted Means Analysis:
    Source           SS    df         MS        F     p
    Between      61.287     2     30.643    9.436 0.001 ***
    Within       94.182    29      3.248

Default group names are chosen by ONEWAY, but others could have been supplied
on the command line.  Individual group and overall summary statistics are
followed by the error bar plots showing the extreme values <angle brackets>,
one standard deviation bars (=signs=), one standard error parentheses, and the
means (#).  The default significance test uses the weighted means solution for
unequal group sizes, but the unweighted means solution can be requested
instead.

4.2.4.  RANKIND: Rank-Order Analysis for Independent Groups

     RANKIND is the non-parametric counterpart ot ONEWAY.  To demonstrate this,
the same analysis on the same data will be done.  In this case, the options to
RANKIND are the same as for ONEWAY.  Note the similarities of the measures of
central tendency, the plots, and the significance levels.

    rankind -p -w 50 -s 999 < oneway.dat

    Output 4: RANKIND - Rank-Order Analysis of Independent Groups
                 N   NA      Min      25%   Median      75%      Max
    Cond-1       9    0     1.00     2.75     5.00     7.25     9.00
    Cond-2      11    0     3.00     4.00     5.00     6.75     8.00
    Cond-3      12    0     7.00     7.50     8.00     8.50     9.00
    Total       32    0     1.00     4.50     7.00     8.00     9.00

    Cond-1    |<         --------------#---------------         >|
    Cond-2    |            <     ------#-----------       >      |
    Cond-3    |                                     <  ---#---  >|
               1.000                                        9.000

    Median-Test:
               Cond-1 Cond-2 Cond-3
        above       2      1      9     12
        below       6      8      0     14
                    8      9      9     26
        WARNING: 6 of 6 cells had expected frequencies less than 5
        chisq      16.387566     df   2      p  0.000276

    Kruskal-Wallis:
        H (not corrected for ties)             13.123192
        Tie correction factor                   0.973424
        H (corrected for ties)                 13.481479
        chisq      13.481479     df   2      p  0.001182

*5.  Combining Programs in BATCH Files

     In this part of the overview, I describe the possibilities for ordinary
users to write their own programs with |STAT and MSDOS utilities.  In so doing,
I will show how |STAT data manipulation and analysis programs work together.

*5.1.  Introduction to BATCH Files

     Much data analysis is routine, with similar steps used for similar data,
and it is useful to be able to repeat a common analysis with different data
sets.  |STAT programs do not have any built-in programming facility, but
instead work with the standard BATCH command script facility built into MSDOS,
and the shell programming languages that come with UNIX.  In this section, I
will describe some typical BATCH files for data analysis.

     A BATCH file contains a sequence of commands.  The inspiration for a
particular BATCH file comes from observing similar sequences of commands
several times.  Perhaps the input files differ, or certain variables change,
but the pattern remains.  When a pattern is saved in a BATCH file, the things
that change (e.g., file names, column numbers) are replaced by variables that
are passed to the BATCH command from the command line.

     I think that BATCH command scripts are underused, perhaps because BATCH is
not well known, so I will summarize the main features.  MSDOS BATCH files have
names that end in .BAT.  This lets the MSDOS command interpreter recognize them
as files with commands in them.  Lines in BATCH files are commands, or labels
beginning with : (colon).  When a BATCH file is run, it is called like any
compiled program.  Parameters can be passed into a command script on the
command line, and their values are accessed inside the BATCH file by
expressions of the form: %N, where N is the position number of the parameter.
For example, if the BATCH file, MYBATCH.BAT, were run with the command

    mybatch Hello Caroline

then, inside the BATCH file, %1 would be Hello and %2 would be Caroline.

     I will not be using all the BATCH commands in Table 4, but the table gives
an idea of the sort of programming possible.  Full details about BATCH are
available in the MSDOS manual.

    Table 4: BATCH Commands
    -----------------------
    echo      control printing of commands, print text
    for       iterative execution of commands
    goto      transfer control to a label
    if        conditional execution of commands
    shift     shift positional parameters
    pause     wait for key press from user
    rem       comment line

*5.2.  Example BATCH Files

     The following BATCH commands on files with several columns, some of which
contain data for a variable, some labels.

5.2.1.  Describing Selected Columns in a File

     To get descriptive statistics for a column in a file, COLEX can extract
the column and DESC can print descriptive statistics.  The following BATCH
command, DESCOL, uses the name of a file and a column number.  The initial ECHO
command annotates the output, and shows the correct number and order of
arguments.

    DESCOL.BAT: Describe Selected Columns
    -------------------------------------
    echo descol file=%1 column=%2
    colex %2 < %1 | desc -oh

5.2.2.  Histograms and Scatterplot of Two Variables

     Suppose we have some paired data: two variables in a multiple-column data
file.  A good screening of the data would include a histogram for each
variable, and a scatterplot of one against the other.  If this analysis is
going to be done often, then it makes sense to put the commands in a BATCH file
like TWOPLOT.

    TWOPLOT.BAT: Paired Data Plotting
    ---------------------------------
    echo twoplot file=%1 var1=%2 var2=%3

    echo Histogram of Variable %2 in %1
    colex %2 < %1 | desc -h

    echo Histogram of Variable %3 in %1
    colex %3 < %1 | desc -h

    echo Scatterplot of %2 Against %3
    colex %2 %3 < %1 | pair -p -x %1[%2] -y %1[%3]

     TWOPLOT takes three arguments: the file name, the column number of the
first variable, and the column number of the second.  We use ECHO commands to
annotate the BATCH file output.  COLEX extracts the named columns for input to
DESC, which produces histograms.  The final command shows COLEX extracting both
columns and piping them to PAIR.  Labels using the file name and the column
numbers are given to the X and Y axes.

5.2.3.  2x2 Contingency Tables

     The most common contingency table is one with two rows and two columns.
For this simple case, the CONTAB multi-way contingency table program input
format is cumbersome because more indexes must be supplied than data.  A batch
file can hide many details.

    2X2.BAT: 2 by 2 Contingency Table Analysis
    ------------------------------------------
    echo A=%1 B=%2 C=%3 D=%4
    echo 1 1 %1 1 2 %2 2 1 %3 2 2 %4 | maketrix 3 | contab %5 %6

5.2.4.  Regression of Data in Many Files

     Data may be kept in separate files, one variable in each one-column file.
Suppose that you want to run a regression to predict the data in one file from
several others.  The ABUT program could be called to join the corresponding
lines, and this could be piped to REGRESS.  In the following BATCH file, as
many as nine files (variables) are made into a matrix format file, which is
piped to REGRESS.  The file names are used as the variable names in the call to
REGRESS.  If less than nine files are supplied, the high variables like %9 are
empty, and ignored by the programs.

    FREG.BAT: Regression of Variables in Files
    ------------------------------------------
    echo freg files ...

    echo Joining files %1 %2 %3 %4 %5 %6 %7 %8 %9
    echo Predict %1 with %2 %3 %4 %5 %6 %7 %8 %9
    abut %1 %2 %3 %4 %5 %6 %7 %8 %9 | regress %1 %2 %3 %4 %5 %6 %7 %8 %9

5.2.5.  Plotting Residuals From Regression

     It is good practice to plot the residuals from linear regression.
Residuals are the differences between the obtained and predicted scores, and
are usually plotted against the scores predicted by regression.  The
correlation between predicted scores and residuals is zero, but a visual
inspection of the plot can sometimes detect non-linear trends not apparent in
the original data.  There is no plotting option built into the REGRESS program.
Instead, there is an option (-e) to save the column number of the predicted
variable and the multiple regression equation for predictions in the file:
REGRESS.EQN.  This file, along with the original data, can be used by DM to
produce the original and predicted values.  In the RESID BATCH file, a second
pass of DM transforms the obtained and predicted scores to predicted and
residual scores.  A BATCH REMinder annotates each transformation.

    RESID.BAT: Regression with Residual Plot
    ----------------------------------------
    echo resid file=%1
    regress -e < %1
    rem    print Y Y'     print Y' Y-Y'  plot Y' Y-Y'
    dm Eregress.eqn < %1 | dm x2 x1-x2 | pair -p -x Predicted -y Residual

5.2.6.  Plotting Functions

     |STAT programs can graph mathematical functions.  SERIES can generate the
domain and DM can compute the range of most functions.  In the PLOTFUN BATCH
file, a DM expression is followed by the low and high values of the domain.
PAIR plots the values, using the DM expression as the Y axis label.
Optionally, an increment other than the default of 1 can be supplied to
generate domain values with more or less granularity.  The examples show the
advantage of interactive data analysis.  Several minor variations on a command
can be tried in a minute.

    PLOTFUN.BAT: Plot a Function
    ----------------------------
    echo plotfun function=%1 low=%2 high=%3 [increment=%4]
    series %2 %3 %4 | dm x1 %1 | pair -p -x x1 -y %1

To plot the squares of the integers from -10 to 10, you would type:

    plotfun x1*x1 -10 10

This would generate the command:

    series -10 10 | dm x1 x1*x1 | pair -p -x x1 -y x1*x1

To get more detail by plotting points every tenth, you would run:

    plotfun x1*x1 -10 10 0.1

Finally, here is a function that looks like a butterfly:

    plotfun sin(x1)*x1 -20 20 .1

5.2.7.  Statistical Tables

     |STAT programs can generate many tables like those at the end of
statistics texts.  DM can produce random numbers and compute common functions
like cosine, square root, inverse, and logarithm.  For example, the following
command generates a table of values, their squares, inverses, square roots, and
logarithms.

    series 1 100 .1 | dm x1 x1*x1 1/x1 sqrt(x1) log(x1)

     The PROBDIST program can compute critical values for several probability
distributions.  For example, the following computes the critical value for a
.05-level t-statistic with 35 degrees of freedom.

    probdist crit t 35 .05

This is a single value, and it is easy to write a BATCH file to generate a
table for critical t-distribution values.  The BATCH file CRITABLE.BAT begins
with ECHO commands to label the output table.  The working part of the command
is on the last line.  A series of low degrees of freedom to high is generated
by SERIES.  This is piped to DM, which adds the strings `crit' and `t' before
the degrees of freedom from SERIES.  Each line in the output from DM looks
like:

    crit t 35 .05

This is the format required for PROBDIST.  PROBDIST produces critical values,
and the -v option produces verbose output with labels on values.  The batch
file would be called with a command line like:

    critable 1 30 .01

    CRITABLE.BAT: Critical t Values
    -------------------------------
    echo Critical values for t for alpha = %3
    series %1 %2 | dm 'crit' 't' x1 %3 | probdist -v

*6.  Technical Notes

*6.1.  Comparison with Other Packages

     |STAT is not a comprehensive package like SPSS (Nie, et al, 1975), BMD
(Dixon, 1977), or SAS (Helwig & Council, 1979), which have versions that run on
PC's, but was developed to meet data analysis needs as they arose.  |STAT is
lacking in multivariate analysis and graphics.  Still, |STAT performs the
analyses most used in education and research.  It is comparable to the popular
MINITAB package (Ryan, Joiner, & Ryan, 1976).  In a sense, |STAT is cream-
skimming because it does not try to deal with large datasets, or the needs of
sophisticated users.

     The only multivariate analysis program is the multiple linear regression
program, REGRESS.  There are no factor, discriminant, or canonical correlation
analyses, and no cluster or multidimensional scaling analyses.  None are likely
to be added in the next few years.  The ANOVA program handles one variable and
no covariates, although analysis of covariance can be approximated by combining
ANOVA with REGRESS.

     Plots in |STAT programs are poverty graphics--graphics on a low budget.
Although they are not pretty, they are effective.  Given the simple format of
|STAT files, and the availability of high quality graphics packages that
interface to such files, such facilities are not likely to appear in |STAT.

     Despite lacking functionality found in many packages, there are strong
advantages to |STAT.  Few packages run on both UNIX and MSDOS, two popular
systems.  This portability, combined with the low cost of |STAT, makes it
attractive for use in research and educational facilities.  |STAT programs are
used even in environments where more powerful packages are available.  |STAT
programs do not put the user into any special environment, which makes them
ideal for quick exploratory data analysis.  The data manipulation programs work
with many MSDOS and UNIX utilities, and the analysis programs allow extremely
rapid analyses.  Few packages can allow the following analysis to be done in
less than 10 seconds, when you count the time to start up a program, enter
data, and get results.

    desc
    desc: reading input from keyboard:
    32 48 47 23 55 14 23
    ^Z
    (output follows practically immediately)

This would lose the data, so a user might prefer to save it in a file first,
either with a text editor, or with the MSDOS COPY utility.

    copy con desc.in
    32 48 47 23 55 14 23
    ^Z
    desc < desc.in

*6.2.  User Interface

     The basic goal of data analysis is to draw valid conclusions about data.
If data are input in the wrong format, or the wrong analysis is selected, then
invalid conclusions may be the result.  The user interface provided by |STAT is
designed to promote correct analysis.  Earlier, I stated that the main user
interface principles in |STAT are simplicity, consistency, robustness, and
feedback.  In this section, I explain how these principles are realized.

6.2.1.  Simplicity   |STAT is a package of simple programs.  |STAT does not
have the full capabilities of packages like SPSS (Nie, et al, 1975), BMD
(Dixon, 1977), or SAS (Helwig & Council, 1979), so it is not as complete a
statistics package, but |STAT offers the most used analyses.  A full data
analysis package has trouble addressing the needs of professional analysts
while still presenting a simple structure to other users, which is a primary
goal of the |STAT package.

6.2.2.  Consistency   |STAT programs are designed to be externally and
internally consistent.  To be externally consistent, the programs adopt the
UNIX tool philosophy of reading the standard input and writing the standard
output.  Program options follow the UNIX command line option standard (Hemenway
& Armitage, 1984).  When learned, this standard is easy to use, except that
program options are single characters, which are not too memorable.  The
problem of cryptic options is alleviated by providing online help.  For
example, if the short usage summary for DESC was not enough, you could get
descriptions and current values of all the options by running DESC with the -O
option.

    Online Option Summary For DESC
    ------------------------------
    desc: descriptive statistics and histograms
    -c            cumulative frequencies or proportions    FALSE
    -f            request table of frequencies             FALSE
    -F Ho         F-test against mean Ho                   0
    -h            request a histogram                      FALSE
    -i width      interval width for tables & histograms   0
    -m min        minimum allowable value                  0
    -M max        maximum allowable value                  0
    -o            request order statistics                 FALSE
    -p            request table of proportions             FALSE
    -s            request summary statistics               FALSE
    -t Ho         t-test against mean Ho                   0
    -v            print statistics in name=value format    FALSE

     |STAT programs are used in a familiar environment, not one that is
inconsistent with other program use.  This is unusual.  Many packages may boast
of having their own input editor, command language or menu interface, and
programming language.  In most cases, these interfaces are inconsistent with
all other systems learned by users.  |STAT does not assume that data analysis
is done in isolation of all other systems, so when an existing system can be
used, a new one is not created.

     The programs are internally consistent by using the same intuitive input
formats and standard option naming conventions.  The manual entries are also
formatted consistently.  Much of the consistency in the package is obtained by
automatic code and documentation generation techniques (Perlman, 1986).

6.2.3.  Robustness   The programs are designed to be robust against user
errors.  All the analysis programs check for invalid data types and ranges, and
give standard error messages identifying problems.  An example error message
from the PAIR program follows.

    pair: -w option value must be between 5 and 100

     |STAT programs do all calculations in double precision, but they do not
use many sophisticated numerical methods, nor the high level of adjusting for
rounding errors and catching overflows found in the well known statistics
packages.  Other than that, |STAT programs do well on evaluations of
statistical software such as the Cornell University Statistical Computing
Support Group test (Cornell, 1985).

     |STAT programs prevent many errors people make in using the big packages.
For example, a common error people make with old packages is to write an
incorrect FORTRAN format statement to read the data, sometimes resulting in the
program reading the wrong data.  |STAT programs deal with format problems by
having analysts use human readable input formats.  Even in the newer large-
package programs, the specification of experimental design can be complicated,
and misused by inexperienced users.  Design specification languages are avoided
completely in |STAT by having programs infer design information from intuitive
data formatting schemes.  For example, the multiple regression program,
REGRESS, requires its input to be a file with the predicted variable in the
first column, followed by a series of predictors.  The most impressive example
of program design specification inference in |STAT is in the multi-factor ANOVA
program.  ANOVA requires that each datum be preceded by labels describing the
conditions under which it was obtained.  From these lines, ANOVA figures out
the experimental design (nesting/crossing of factors).

6.2.4.  Feedback   |STAT programs give users feedback about data formats by
letting them see the format of the data at any stage of transformation or
analysis.  When programs are used interactively, users are given program-
specific prompts for input.  All |STAT messages identify the program they are
from, important when several programs are combined in a pipeline, and they give
diagnostic information to help fix errors.  To give users fast access to
information about each program, there are three levels of online help.  The
first level is a short program usage summary.  The second level includes
standard help options to tell users about program version numbers (-V), program
data set size limits (-L), and program options: names, descriptions, and
current values (-O).  For example, the limits for the PAIR program are printed
with the -L option.

    pair -L
    pair: program limits:
          1000    maximum number of pairs for plots
           100    maximum width of plot
           100    maximum height of plot
             5    minimum plot height or width
           512    maximum number of characters in lines

The latest version of REGRESS is found with the -V option.

    regress -V
    Program: regress  Version: 5.3  Date: 11/25/86

These built-in options give accurate information for programs even if the
documentation has lagged behind.  The third level of online help includes full
program manual entries online; these are displayed with the MANSTAT program.

*6.3.  Hardware and Software Requirements

     |STAT runs on two popular operating systems: MSDOS and UNIX.  |STAT runs
on any operating systems compatible with MSDOS versions 2 and 3.  |STAT
programs run on any UNIX system with a C compiler.  The programs use few and
only standard system calls, and do not require any special hardware, making
|STAT very portable.

     |STAT has small memory requirements.  The programs and online
documentation take up about 900K on MSDOS, and 500K on a VAX running UNIX.  The
largest programs on MSDOS are about 60K on a disk, so running the programs with
one floppy disk is possible, although a hard disk would be faster.  |STAT
programs using dynamic memory allocation will try to fit all their input data
in main memory, so they can fit as much data as there is memory.  The programs
are not designed for large data sets, but are robust with several thousand
points.  |STAT programs can work with or without a math coprocessor or floating
point accelerator, but computationally intensive programs will naturally run
faster with special hardware.

*6.4.  |STAT Distribution

     |STAT is distributed under a liberal copyright.  Users can make make and
distribute unlimited copies of any part of the package, provided |STAT is not
distributed for gain.  |STAT users are not permitted to create derivative works
based on the C source code; source is distributed for compilation only.
Experience with users writing their own programs based on |STAT has been bad:
Some well-meaning people wrote and distributed programs with serious bugs in
them.

     |STAT is distributed without warranty, and the program users bear all
risks and costs.  Any part of |STAT may be changed at any time without warning.
|STAT is distributed only under these terms.

     |STAT was developed at the University of California, at San Diego, and at
the Wang Institute of Graduate Studies.  Versions of |STAT have been sent to
over 500 UNIX sites and there have been over 300 MSDOS distributions after Fred
Horan first ported the 5.1 release of the package to MSDOS at Cornell
University in 1985.  The MSDOS edition includes executable programs and online
manual entries.  The package runs on almost any IBM PC compatible running
version 2 or 3 of MSDOS.  The UNIX version of the package is distributed on
tape and includes C (Kernighan & Ritchie, 1979) source files and online manual
entries.  The package runs on almost any UNIX lookalike system with a C
compiler.

     All distributions of |STAT materials include worldwide postage.  The UNIX
version of |STAT is distributed for $20, and includes a nine track half inch
magnetic tape in 1600 bpi tar format.  The MSDOS version is distributed on
double-sided double-density floppy diskettes for $15.  The |STAT handbook (with
examples, conventions, program summaries, and DM and CALC manuals) costs $10.
Orders for the programs must be prepaid to G. Perlman at the Department of
Computer and Information Science, The Ohio State University, Columbus, OH
43210 USA.  Orders must include a check in U.S. funds drawn on a U.S. bank
and should be accompanied by a clear international mailing label.

*7.  References

Bradley, J. V.  (1968) Distribution-Free Statistical Tests.  Englewood Cliffs,
     NJ: Prentice-Hall.

Brooks, C. A.  (1985) Experiences with Electronic Software Distribution.  In
     Summer USENIX Conference.  El Cerito, CA: Usenix Ass'n.  pp. 433-436.

Coombs, C. H., Dawes, R. M., & Tversky, A.  (1970) Mathematical Psychology: An
     Elementary Introduction.  Englewood Cliffs, NJ: Prentice-Hall.

Cornell, U.  (1985) Software Evaluation Form: Statistical Computing Support
     Group.  Ithaca, NY: Cornell University.

Dixon, W. J.  (1975) BMD-P Biomedical Computer Programs.  Berkeley, CA:
     University of California Press.

Helwig, J. T., & Council, K. A.  (Eds.) (1979) SAS User's Guide.  Cary, NC: SAS
     Institute.

Hemenway, K., & Armitage, H.  (1984) Proposed Syntax Standard for UNIX System
     Commands.  In Summer USENIX Conference.  El Cerito, CA: Usenix
     Association.  (Washington, DC.)

Keppel, G.  (1973) Design and Analysis: A Researcher's Handbook.  Englewood
     Cliffs, NJ: Prentice-Hall.

Kerlinger, F. N., & Pedhazur, E. J.  (1973) Multiple Regression in Behavioral
     Research.  New York, NY: Holt Rinehart Winston.

Kernighan, B. W., & Ritchie, D. M.  (1979) The C Programming Language.
     Englewood Cliffs, NJ: Prentice-Hall.

Korn, D. G.  (1983) KSH: A Shell Programming Language.  In Summer USENIX
     Conference.  El Cerito, CA: Usenix Association.  pp. 191-202.

Nie, H. H., Jenkins, J. G., Steinbrenner, K., & Bent, D. H.  (1975) SPSS:
     Statistical Package for the Social Sciences.  New York: McGraw-Hill.

Perlman, G.  (1986) Multilingual Programming: Coordinating Programs, User
     Interfaces, On-Line Help, and Documentation.  ACM SIGDOC Asterisk.
     123-129.

Perlman, G., & Horan, F. L.  (1986) Report on |STAT Release 5.1 Data Analysis
     Programs for UNIX and MSDOS.  Behavior Research Methods, Instruments, &
     Computers, 18.2, 168-176.

Perlman, G.  (1987) The |STAT Handbook.  (3rd Edition).  Tyngsboro, MA: Wang
     Institute of Graduate Studies.

Ritchie, D. M., & Thompson, K.  (1974) The UNIX Time-Sharing System.
     Communications of the Association for Computing Machinery, 17:7, 365-375.

Ryan, T. A., Joiner, B. L., & Ryan, B. F.  (1976) MINITAB Student Handbook.
     North Scituate, MA: Duxbury Press.

Siegel, S.  (1956) Nonparametric Methods for the Behavioral Sciences.  New
     York: McGraw-Hill.