




                           CHAPTER 12

               INTRODUCTION TO MULTIPLE REGRESSION



  This  program  enables  you  to  conduct  both simultaneous and
hierarchical multiple regression analysis.  It is modeled largely 
on the presentation of this topic as described by  Cohen,  J.   &
Cohen,  P.  'Applied Multiple Regression/Correlation Analysis for 
the Behavioral  Sciences',  (2nd  ed).  Hillsdale,  NJ:  Lawrence
Erlbaum, 1983.  

  The  program enables you to employ a summary data file or a raw
data file.  A summary data file contains only the means, standard 
deviations, and correlations for the variables that will be  used
in  the  regression  analysis.  It may have 51 or fewer variables 
from which you may select any subset for use  in  conducting  the
regression  analysis.   A  raw  data  file  may  have  up  to 200 
variables and there is no limit to the number of cases.   If  you
use  a  raw  data  file  having  more than 51 variables, you must 
select 51 or fewer  of  them  for  inclusion  in  the  regression
analysis.

  If you already have a raw or summary data file that is properly
constructed  for  use with this program, you may proceed with the 
analysis of your data.  If you do not  yet  have  a  proper  data
file,  please  examine the procedures for constructing input data 
files.


             IMPORTANCE OF USING SUMMARY DATA FILES

  Suppose you have a raw data file containing, say, 837 cases and 
48 variables.  Suppose further that  you  plan  to  conduct  nine
separate  regression  analyses  using only 17 of the variables in 
that data file.  Although the multiple  regression  program  will
accept  and  process that raw data file, the program will have to 
read the  837  cases  and  48  variables  nine  different  times.
Although  the SPPC is not slow, it will take a bit of time to get
the job done.

  For a problem such as this one it is  highly  recommended  that
you  use the correlation matrix program and compute a correlation 
matrix for the 17 variables which  will  be  used  in  your  nine
regression analyses.  The correlation matrix program will prepare
a  summary  data  file  containing  the  variable means, standard 
deviations, and correlations.  If you will do that once and  save
the  output  to  a  disk file, that summary data file can then be 
used as the input file for the multiple regression program.   The
result  will  be  that  your  nine  regression  analyses  will be
performed much more quickly.  Moreover, it is easier to work with
the selected 17 variables rather than all 48 variables in the raw
data file.

  The multiple regression program ALWAYS works from a correlation
matrix.   Thus,  when  you enter a raw data file the program must 
first  compute  the  correlation  matrix  and  then   solve   the
regression  problem.   The program has therefore been constructed 
so that you can enter  a  previously  constructed  (or  computed)
correlation  matrix  as  a 'summary data file'.  It can be a very 
efficient approach for those who work with even modest sized  raw
data files.


                         MISSING VALUES

  When you analyze a raw data  file  YOU  MUST  ENTER  A  MISSING
VALUES  CODE.   When you are creating a raw data file you may use 
one missing values code which will apply to all  variables.   The
missing  values  code MUST be a positive or negative real number.
It cannot contain a character.  

  For example, if your  data  file  contains  both  positive  and
negative  values  you may wish to use as your missing values code 
some extremely large value that will always exceed the  value  of
valid  numeric data -- , say, 1.0E200 (entered as 1e200).  On the 
other hand, if you know that  all  your  valid  data  values  are
positive numbers, you may wish to use as your missing values code
a  number  as  simple as, say, -1. You may use any real number as
your missing values code.  

  When  creating a raw data file, it is recommended that you note
the missing values code in the file header.  If, for example, you 
use a missing values code of -9, it would  be  good  practice  to
place  'MV  =  -9' at the end of the file header.  Then, whenever 
you call up the file you will see  the  missing  values  code  on
screen.   Moreover,  if  you  put  aside  a data file for several 
months and then choose to work with it again you  may  have  then
forgotten the missing values code.

  Even if your file has no missing values you must still give the
program  a  missing  values code.  In such cases we recommend the
consistent use of 1e200. 

  When you execute the multiple regression program  using  a  raw
data  file,  the  program  will ask you to enter a missing values 
code.  If your file has no missing values, merely enter 1e200  as
the missing values code.  YOU MUST SUPPLY A MISSING VALUES CODE.


                      REORDERING VARIABLES

  The  multiple  regression  program  always  presumes  that  the 
dependent variable is the very first variable in  your  file  and
that  it  has an order code of 0.  That is, all variables have an 
order code ranging  from  0  to  k  where  k  is  the  number  of
independent variables that you will employ.

  If  the  first  variable  in  your  file  is  NOT the dependent 
variable, you will have to re-order your variables.  The  program
will  present  an  opportunity  to re-order your variables if you 
need to do that.  If you elect to  re-order  your  variables  for
either  a  raw  or  summary  data file, complete instructions for 
doing that will  be  presented  to  you  when  you  conduct  your
analysis.


                       COMPUTING RESIDUALS

  If you use a raw data file and choose the  'listwise'  deletion
option,  you  will  then  have  the option of computing estimated 
values of the dependent variable, Y' = Xb, and the  residuals  or
error  estimates,  e  = Y - Y'.  The program will also report the
residuals as t-statistics where t = e/see and see is the standard
error of estimate. 

  If you choose to compute residuals, you will be asked  for  the
name  of  your input file.  Be sure to use the same raw data file 
when computing residuals.  You may  send  the  residuals  to  the
screen, the printer, or a disk file.


                      REDUNDANCY ESTIMATES

  The  program  will  always  provide a 'redundancy' estimate for 
each independent variable which represents the degree of  overlap
or   shared  variance  among  the  independent  variables  in  an 
hierarchical fashion.  The first value will always  be  zero  and
the  last  one  describes the overlap amongst all the independent 
variables.  These  estimates  help  to  describe  the  degree  of
multicolinearity amongst the independent variables.

  For the technically curious, the redundancy estimates are based
on  the diagonal elements of the upper triangular Cholesky factor
of the correlation matrix.


                       TESTING THE PROGRAM

  If  you  wish to test the computational accuracy of the program 
to see how it functions, choose the option to use a summary  data
file and the use the file on your SPPC Disk 1 named LONGLEYS.DAT.
You can also choose the raw data option and use the raw data file
named LONGLEYR.DAT.  It too is stored on your SPPC Disk 1.

  The  cited  reference  that will appear on screen when you call
for the file provides a good discussion of computational accuracy
and some of the problems in achieving that.


     CONDUCTING ORDINARY (SIMULTANEOUS) MULTIPLE REGRESSION

  The simplest and most straightforward application  of  multiple
regression is to fit a so-called simultaneous regression model in
which   you  wish  to  determine  how  well  several  independent 
variables  will  predict  or  account  for  a  single   dependent
variable.   For  an excellent introduction to this topic, see the 
first three chapters of the  text  by  Cohen,  J.   &  Cohen,  P.
'Applied   Multiple   Regression/Correlation   Analysis  for  the 
Behavioral Sciences', (2nd ed). Hillsdale, NJ: Lawrence  Erlbaum,
1983.

  Suppose, for example, that you wish to use several variables to 
predict  salaries  of  university  professors.   A   hypothetical
example is shown in the next screen in which salary is thought to
be  predictable from a knowledge of (1) number of years teaching,
(2) number of journal articles published, (3) gender, and (4) the 
number of times the professor's published works have  been  cited
(Cohen & Cohen, p. 99).

Order Code:     0        1          2           3          4
Variable:     SALARY   YEARS   PUBLICATIONS   GENDER   CITATIONS

              18000      1          2           0          1
              19961      2          4           0          0
              19828      5          5           1          1
              17030      7         12           1          0
              19925     10          5           0          0
              19041      4          9           0          1
              27132      3          3           1          0
              27268      8          1           0          1
              32483      4          8           0          0
              27029     16         12           1          4
              25362     15          9           0          0
              28463     19          4           0          3
              32931      8          8           0          5
              28270     14         11           0          0
              38362     28         21           0          3
  

  Since  SALARY is the dependent variable, these data can be used 
without reordering the variables.  If you wish to  analyze  these
data  to  see the results, choose the raw data option and use the 
raw data file on your SPPC Disk 1  having  the  name  COHENR.DAT.
Or,  you  can  choose  the  summary  data option and then use the
summary data file on the SPPC Disk 1 having the name COHENS.DAT.

  Please note that you will automatically be given the results of
both  simultaneous  and  hierarchical multiple regression.  Since 
the research task in this example is limited to  determining  how
well the four independent variables will account for, predict, or
explain  the  dependent  variable,  you  should  ignore  all  the
hierarchical regression results.

  Another interesting way to analyze the above data is to examine 
the extent to which the number  of  publications  (productivity?)
can  be predicted from knowledge of the other four variables.  If 
you wish to address that question,  analyze  the  same  data  but
re-order the variables as follows:

 Order Code:     1       2          0           3          4
 Variable:    SALARY   YEARS   PUBLICATIONS   GENDER   CITATIONS

  Partial  results  of  this  analysis are shown as follows:

            SIMULTANEOUS MULTIPLE REGRESSION RESULTS

               Raw Score b-  Standardized
Variable Name  Coefficients      Beta       t-ratio      p <=
-------------  ------------  ------------  ---------  ----------
    Intercept       0.10117
       SALARY       0.00011       0.13998    0.45220      0.6636
        YEARS       0.45272       0.66087    2.27160      0.0445
       GENDER       2.09414       0.18656    0.79723      0.5510
    CITATIONS      -0.23785      -0.07719   -0.28770      0.7753


                       ANALYSIS OF VARIANCE

                                            Hierarchical
   VARIANCE             SUM OF       MEAN      Step-down
    SOURCE        df    SQUARES     SQUARES      F-ratio   p <=
---------------  ---  ---------  ----------  -----------  ------
         SALARY   1    78.42220    78.42220      4.26177  0.0636
          YEARS   1    94.95450    94.95450      5.16021  0.0445
         GENDER   1    10.68710    10.68710      0.58078  0.5308
      CITATIONS   1     1.52313     1.52313      0.08277  0.7753
For this set:     4   185.58700    46.39670      2.52138  0.1072

          Error  10   184.01300    18.40130
          Total  14   369.60000


                       SUMMARY STATISTICS

        Dependent Variable           =    PUBLICATIONS

        Omnibus F-ratio              =        2.521383
        Significance Level,       p <=        0.107212

        Multiple R                   =        0.708610
        Squared Multiple R           =        0.502129
        Shrunken R                   =        0.550437
        Shrunken Squared R           =        0.302980
        Determinant of Rxx           =        0.397621

        Regression Sum of Squares    =      185.587000
        Error Sum of Squares         =      184.013000
        Standard Deviation of y'     =        4.307980
        Standard Error of Regression =        4.289674

