Subject: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Supersedes: <nn2.posting_825566422@hotellng.unx.sas.com>
Date: Fri, 29 Mar 1996 04:00:14 GMT

URL: ftp://ftp.sas.com/pub/neural/FAQ2.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

This is part 2 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.

========== Questions ========== 
********************************

Part 1: Introduction
Part 2: Learning

   How many learning methods for NNs exist? Which?
   What is backprop?
   What are conjugate gradients, Levenberg-Marquardt, etc.?
   How should categories be coded?
   Why use a bias input?
   Why use activation functions?
   What is a softmax activation function?
   What is overfitting and how can I avoid it?
   What is jitter? (Training with noise)
   What is early stopping?
   What is weight decay?
   What is Bayesian estimation?
   How many hidden units should I use?
   How can generalization error be estimated?
   What are cross-validation and bootstrapping?
   Should I normalize/standardize/rescale the data?
   What is ART?
   What is PNN?
   What is GRNN?
   What about Genetic Algorithms and Evolutionary Computation?
   What about Fuzzy Logic?

Part 3: Information resources
Part 4: Datasets
Part 5: Free software
Part 6: Commercial software
Part 7: Hardware

------------------------------------------------------------------------

Subject: How many learning methods for NNs exist?
=================================================
Which?
======

There are many many learning methods for NNs by now. Nobody knows exactly
how many. New ones (or at least variations of existing ones) are invented
every week. Below is a collection of some of the most well known methods,
not claiming to be complete.

The main categorization of these methods is the distinction between
supervised and unsupervised learning: 

 o In supervised learning, there is a "teacher" who in the learning phase
   "tells" the net how well it performs ("reinforcement learning") or what
   the correct behavior would have been ("fully supervised learning"). 
 o In unsupervised learning the net is autonomous: it just looks at the data
   it is presented with, finds out about some of the properties of the data
   set and learns to reflect these properties in its output. What exactly
   these properties are, that the network can learn to recognise, depends on
   the particular network model and learning method. Usually, the net learns
   some compressed representation of the data. 

Many of these learning methods are closely connected with a certain (class
of) network topology.

Now here is the list, just giving some names:

1. UNSUPERVISED LEARNING (i.e. without a "teacher"):
     1). Feedback Nets:
        a). Additive Grossberg (AG)
        b). Shunting Grossberg (SG)
        c). Binary Adaptive Resonance Theory (ART1)
        d). Analog Adaptive Resonance Theory (ART2, ART2a)
        e). Discrete Hopfield (DH)
        f). Continuous Hopfield (CH)
        g). Discrete Bidirectional Associative Memory (BAM)
        h). Temporal Associative Memory (TAM)
        i). Adaptive Bidirectional Associative Memory (ABAM)
        j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM)
        k). Competitive learning
     2). Feedforward-only Nets:
        a). Learning Matrix (LM)
        b). Driver-Reinforcement Learning (DR)
        c). Linear Associative Memory (LAM)
        d). Optimal Linear Associative Memory (OLAM)
        e). Sparse Distributed Associative Memory (SDM)
        f). Fuzzy Associative Memory (FAM)
        g). Counterprogation (CPN)

2. SUPERVISED LEARNING (i.e. with a "teacher"):
     1). Feedback Nets:
        a). Brain-State-in-a-Box (BSB)
        b). Fuzzy Congitive Map (FCM)
        c). Boltzmann Machine (BM)
        d). Mean Field Annealing (MFT)
        e). Recurrent Cascade Correlation (RCC)
        f). Backpropagation through time (BPTT)
        g). Real-time recurrent learning (RTRL)
        h). Recurrent Extended Kalman Filter (EKF)
     2). Feedforward-only Nets:
        a). Perceptron
        b). Adaline, Madaline
        c). Backpropagation (BP)
        d). Cauchy Machine (CM)
        e). Adaptive Heuristic Critic (AHC)
        f). Time Delay Neural Network (TDNN)
        g). Associative Reward Penalty (ARP)
        h). Avalanche Matched Filter (AMF)
        i). Backpercolation (Perc)
        j). Artmap
        k). Adaptive Logic Network (ALN)
        l). Cascade Correlation (CasCor)
        m). Extended Kalman Filter(EKF)
        n). Learning Vector Quantization (LVQ)
        o). Probabilistic Neural Network (PNN)
        p). General Regression Neural Network (GRNN) 

------------------------------------------------------------------------

Subject: What is backprop? 
===========================

Backprop is short for backpropagation of error. The term backpropagation
causes much confusion. Strictly speaking, backpropagation refers to the
method for computing the error gradient for a feedforward network, a
straightforward but elegant application of the chain rule of elementary
calculus (Werbos 1994). By extension, backpropagation or backprop refers
to a training method that uses backpropagation to compute the gradient. By
further extension, a backprop network is a feedforward network trained by
backpropagation. 

Standard backprop is a euphemism for the generalized delta rule, the
training algorithm that was popularized by Rumelhart, Hinton, and Williams
in chapter 8 of Rumelhart and McClelland (1986), which remains the most
widely used supervised training method for neural nets. The generalized
delta rule (including momentum) is called the heavy ball method in the
numerical analysis literature (Poljak 1964; Bertsekas 1995, 78-79). 

Standard backprop can be used for on-line training (in which the weights are
updated after processing each case) but it does not converge. To obtain
convergence, the learning rate must be slowly reduced. This methodology is
called stochastic approximation. 

For batch processing, there is no reason to suffer through the slow
convergence and the tedious tuning of learning rates and momenta of standard
backprop. Much of the NN research literature is devoted to attempts to speed
up backprop. Most of these methods are inconsequential; two that are
effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller and Braun
1993). But conventional methods for nonlinear optimization are usually
faster and more reliable than any of the "props". See "What are conjugate
gradients, Levenberg-Marquardt, etc.?". 

References on backprop: 

   Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
   Scientific, ISBN 1-886529-14-0. 

   Poljak, B.T. (1964), "Some methods of speeding up the convergence of
   iteration methods," Z. Vycisl. Mat. i Mat. Fiz., 4, 1-17. 

   Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986), "Learning
   internal representations by error propagation", in Rumelhart, D.E. and
   McClelland, J. L., eds. (1986), Parallel Distributed Processing:
   Explorations in the Microstructure of Cognition, Volume 1, 318-362,
   Cambridge, MA: The MIT Press. 

   Werbos, P.J. (1994), The Roots of Backpropagation, NY: John Wiley &
   Sons. 

References on stochastic approximation: 

   Robbins, H. & Monro, S. (1951), "A Stochastic Approximation Method",
   Annals of Mathematical Statistics, 22, 400-407. 

   Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for
   Constrained and Unconstrained Systems, Springer-Verlag. 

   White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

References on better props: 

   Fahlman, S.E. (1989), "Faster-Learning Variations on Back-Propagation: An
   Empirical Study", in Touretzky, D., Hinton, G, and Sejnowski, T., eds., 
   Proceedings of the 1988 Connectionist Models Summer School, Morgan
   Kaufmann, 38-51. 

   Riedmiller, M. and Braun, H. (1993), "A Direct Adaptive Method for Faster
   Backpropagation Learning: The RPROP Algorithm", Proceedings of the IEEE
   International Conference on Neural Networks 1993, San Francisco: IEEE. 

------------------------------------------------------------------------

Subject: What are conjugate gradients,
======================================
Levenberg-Marquardt, etc.? 
===========================

Training a neural network is, in most cases, an exercise in numerical
optimization of a usually nonlinear function. Methods of nonlinear
optimization have been studied for hundreds of years, and there is a huge
literature on the subject in fields such as numerical analysis, operations
research, and statistical computing, e.g., Bertsekas 1995, Gill, Murray, and
Wright 1981. There is no single best method for nonlinear optimization. You
need to choose a method based on the characteristics of the problem to be
solved. For functions with continuous second derivatives (which would
include feedforward nets with the most popular differentiable activation
functions and error functions), three general types of algorithms have been
found to be effective for most practical purposes: 

 o For a small number of weights, stabilized Newton and Gauss-Newton
   algorithms, including various Levenberg-Marquardt and trust-region
   algorithms are efficient. 
 o For a moderate number of weights, various quasi-Newton algorithms are
   efficient. 
 o For a large number of weights, various conjugate-gradient algorithms are
   efficient. 

All of the above methods find local optima. For global optimization, there
are a variety of approaches. You can simply run any of the local
optimization methods from numerous random starting points. Or you can try
more complicated methods designed for global optimization such as simulated
annealing or genetic algorithms (see Reeves 1993 and "What about Genetic
Algorithms and Evolutionary Computation?"). 

For a survey of optimization software, see More\' and Wright (1993). For
more on-line information on numerical optimization see: 

 o The kangaroos, a nontechnical description of various optimization
   methods, at ftp://ftp.sas.com/pub/neural/kangaroos. 
 o John Gregory's nonlinear programming FAQ at 
   http://www.skypoint.com/subscribers/ashbury/nonlinear-programming-faq.html.
 o Arnold Neumaier's page on global optimization at 
   http://solon.cma.univie.ac.at/~neum/glopt.html. 

References: 

   Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
   Scientific, ISBN 1-886529-14-0. 

   Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,
   Academic Press: London. 

   Levenberg, K. (1944) "A method for the solution of certain problems in
   least squares," Quart. Appl. Math., 2, 164-168. 

   Marquardt, D. (1963) "An algorithm for least-squares estimation of
   nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. 

   More\', J.J. (1977) "The Levenberg-Marquardt algorithm: implementation
   and theory," in Watson, G.A., ed., _Numerical Analysis_, Lecture Notes in
   Mathematics 630, Springer-Verlag, Heidelberg, 105-116. 

   More\', J.J. and Wright, S.J. (1993), Optimization Software Guide,
   Philadelphia: SIAM, ISBN 0-89871-322-6. 

   Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial
   Problems, NY: Wiley. 

   Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A
   Survey, International Series of Numerical Mathematics, vol. 87, Basel:
   Birkhauser Verlag. 

------------------------------------------------------------------------

Subject: How should categories be coded? 
=========================================

First, consider unordered categories. If you want to classify cases into one
of C categories (i.e. you have a categorical target variable), use 1-of-C
coding. That means that you code C binary (0/1) target variables
corresponding to the C categories. Statisticians call these "dummy"
variables. Each dummy variable is given the value zero except for the one
corresponding to the correct category, which is given the value one. Then
use a softmax output activation function (see "What is a softmax activation
function?") so that the net, if properly trained, will produce valid
posterior probability estimates. If the categories are Red, Green, and Blue,
then the data would look like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0   0
    Green      0   1   0
    Blue       0   0   1

When there are only two categories, it is simpler to use just one dummy
variable with a logistic output activation function; this is equivalent to
using softmax with two dummy variables. 

The common practice of using target values of .1 and .9 instead of 0 and 1
prevents the outputs of the network from being directly interpretable as
posterior probabilities. 

Another common practice is to use a logistic activation function for each
output. Thus, the outputs are not constrained to sum to one, so they are not
valid posterior probability estimates. The usual justification advanced for
this procedure is that if a test case is not similar to any of the training
cases, all of the outputs will be small, indicating that the case cannot be
classified reliably. This claim is incorrect, since a test case that is not
similar to any of the training cases will require the net to extrapolate,
and extrapolation is thoroughly unreliable; such a test case may produce all
small outputs, all large outputs, or any combination of large and small
outputs. If you want a classification method that detects novel cases for
which the classification may not be reliable, you need a method based on
probability density estimation. For example, see "What is PNN?". 

It is very important not to use a single variable for an unordered
categorical target. Suppose you used a single variable with values 1, 2, and
3 for red, green, and blue, and the training data with two inputs looked
like this: 

      |    1    1
      |   1   1
      |       1   1
      |     1   1
      | 
      |      X
      | 
      |    3   3           2   2
      |     3     3      2
      |  3   3            2    2
      |     3   3       2    2
      | 
      +----------------------------

Consider a test point located at the X. The correct output would be that X
has about a 50-50 chance of being a 1 or a 3. But if you train with a single
target variable with values of 1, 2, and 3, the output for X will be the
average of 1 and 3, so the net will say that X is definitely a 2! 

For an input with categorical values, you can use 1-of-(C-1) coding if the
network has a bias unit. This is just like 1-of-C coding, except that you
omit one of the dummy variables (doesn't much matter which one). Using all C
of the dummy variables creates a linear dependency on the bias unit, which
is not advisable unless you are using weight decay or Bayesian estimation or
some such thing that requires all C weights to be treated on an equal basis.
1-of-(C-1) coding looks like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue       0   0

Another possible coding is called "effects" coding or "deviations from
means" coding in statistics. It is like 1-of-(C-1) coding, except that when
a case belongs to the category for the omitted dummy variable, all of the
dummy variables are set to -1, like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue      -1  -1

As long as a bias unit is used, any network with effects coding can be
transformed into an equivalent network with 1-of-(C-1) coding by a linear
transformation of the weights. So the only advantage of effects coding is
that the dummy variables require no standardizing (see "Should I
normalize/standardize/rescale the data?"). 

Now consider ordered categories. For inputs, some people recommend a
"thermometer code" like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   1   1
    Green      0   1   1
    Blue       0   0   1

However, thermometer coding is equivalent to 1-of-C coding, in that for any
network using 1-of-C coding, there exists a network with thermometer coding
that produces identical outputs; the weights in the thermometer-coded
network are just the differences of successive weights in the 1-of-C-coded
network. To get a genuinely ordinal representation, you must constrain the
weights connecting the dummy variables to the hidden units to be nonnegative
(except for the first dummy variable). 

It is often effective to represent an ordinal input as a single variable
like this: 

   Category  Input
   --------  -----
    Red        1
    Green      2
    Blue       3

Although this representation involves only a single quantitative input,
given enough hidden units, the net is capable of computing nonlinear
transformations of that input that will produce results equivalent to any of
the dummy coding schemes. But using a single quantitative input makes it
easier for the net to use the order of the categories to generalize when
that is appropriate. 

B-splines provide a way of coding ordinal inputs into fewer than C variables
while retaining information about the order of the categories. See Gifi
(1990, 365-370). 

Target variables with ordered categories require thermometer coding. The
outputs are thus cumulative probabilities, so to obtain the posterior
probability of any category except the first, you must take the difference
between successive outputs. It is often useful to use a proportional-odds
model, which ensures that these differences are positive. For more details
on ordered categorical targets, see McCullagh and Nelder (1989, chapter 5). 

References: 

   Gifi, A. (1990), Nonlinear Multivariate Analysis, NY: John Wiley & Sons,
   ISBN 0-471-92620-5. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

------------------------------------------------------------------------

Subject: Why use a bias input? 
===============================

Consider a multilayer perceptron. Choose any hidden unit or output unit.
Let's say there are N inputs to that unit, which define an N-dimensional
space. The given unit draws a hyperplane through that space, producing an
"on" output on one side and an "off" output on the other. (With sigmoid
units the plane will not be sharp -- there will be some gray area of
intermediate values near the separating plane -- but ignore this for now.) 

The weights determine where this hyperplane lies in the input space. Without
a bias input, this separating hyperplane is constrained to pass through the
origin of the space defined by the inputs. For some problems that's OK, but
in many problems the hyperplane would be much more useful somewhere else. If
you have many units in a layer, they share the same input space and without
bias would ALL be constrained to pass through the origin. 

The "universal approximation" property of multilayer perceptrons does not
hold if you omit the bias units. 

------------------------------------------------------------------------

Subject: Why use activation functions? 
=======================================

Activation functions for the hidden units are needed to introduce
nonlinearity into the network. Without nonlinearity, hidden units would not
make nets more powerful than just plain perceptrons (which do not have any
hidden units, just input and output units). The reason is that a composition
of linear functions is again a linear function. However, it is the
nonlinearity (i.e, the capability to represent nonlinear functions) that
makes multilayer networks so powerful. Almost any nonlinear function does
the job, although for backpropagation learning it must be differentiable and
it helps if the function is bounded; the sigmoidal functions such as
logistic and tanh and the Gaussian function are the most common choices. 

For the output units, you should choose an activation function suited to the
distribution of the target values. Bounded activation functions such as the
logistic are particularly useful when the target values have a bounded
range. But if the target values have no known bounded range, it is better to
use an unbounded activation function, most often the identity function
(which amounts to no activation function). There are certain natural
associations between output activation functions and various noise
distributions which have been studied by statisticians in the context of
generalized linear models. The output activation function is the inverse of
what statisticians call the "link function". See: 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Jordan, M.I. (1995), "Why the logistic function? A tutorial discussion on
   probabilities and neural networks", 
   ftp://psyche.mit.edu/pub/jordan/uai.ps.Z. 

For more information on activation functions, see Donald Tveter's 
Backpropagator's Review. 

------------------------------------------------------------------------

Subject: What is a softmax activation function? 
================================================

The purpose of the softmax activation function is to make the sum of the
outputs equal to one, so that the outputs are interpretable as posterior
probabilities. Let the net input to each output unit be q_i, i=1,...,c where
c is the number of categories. Then the softmax output p_i is: 

           exp(q_i)
   p_i = ------------
          c
         sum exp(q_j)
         j=1

Unless you are using weight decay or Bayesian estimation or some such thing
that requires the weights to be treated on an equal basis, you can choose
any one of the output units and leave it completely unconnected--just set
the net input to 0. Connecting all of the output units will just give you
redundant weights and will slow down training. To see this, add an arbitrary
constant z to each net input and you get: 

           exp(q_i+z)       exp(q_i) exp(z)       exp(q_i)    
   p_i = ------------   = ------------------- = ------------   
          c                c                     c            
         sum exp(q_j+z)   sum exp(q_j) exp(z)   sum exp(q_j)  
         j=1              j=1                   j=1

so nothing changes. Hence you can always pick one of the output units, and
add an appropriate constant to each net input to produce any desired net
input for the selected output unit, which you can choose to be zero or
whatever is convenient. You can use the same trick to make sure that none of
the exponentials overflows. 

Statisticians usually call softmax a "multiple logistic" function. It
reduces to the simple logistic function when there are only two categories.
Suppose you choose to set q_2 to 0. Then 

           exp(q_1)         exp(q_1)              1
   p_1 = ------------ = ----------------- = -------------
          c             exp(q_1) + exp(0)   1 + exp(-q_1)
         sum exp(q_j)
         j=1

and p_2, of course, is 1-p_1. 

The softmax function derives naturally from log-linear models and leads to
convenient interpretations of the weights in terms of odds ratios. You
could, however, use a variety of other nonnegative functions on the real
line in place of the exp function. Or you could constrain the net inputs to
the output units to be nonnegative, and just divide by the sum--that's
called the Bradley-Terry-Luce model. 

References: 

   Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward
   Classification Network Outputs, with Relationships to Statistical Pattern
   Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing:
   Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp.
   227-236. 

   Bridle, J.S. (1990b). Training Stochastic Model Recognition Algorithms as
   Networks can lead to Maximum Mutual Information Estimation of Parameters.
   In: D.S.Touretzky (ed.), Advances in Neural Information Processing
   Systems 2, San Mateo: Morgan Kaufmann, pp. 211-217. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. See Chapter 5. 

------------------------------------------------------------------------

Subject: What is overfitting and how can I avoid it? 
=====================================================

The critical issue in developing a neural network is generalization: how
well will the network make predictions for cases that are not in the
training set? NNs, like other flexible nonlinear estimation methods such as
kernel regression and smoothing splines, can suffer from either underfitting
or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting.
A network that is too complex may fit the noise, not just the signal,
leading to overfitting. Overfitting is especially dangerous because it can
easily lead to predictions that are far beyond the range of the training
data with many of the common types of NNs. But underfitting can also produce
wild predictions in multilayer perceptrons, even with noise-free data. There
are graphical examples of overfitting and underfitting in Sarle (1995). 

The best way to avoid overfitting is to use lots of training data. If you
have at least 30 times as many training cases as there are weights in the
network, you are unlikely to suffer from overfitting. But you can't
arbitrarily reduce the number of weights for fear of underfitting. 

Given a fixed amount of training data, there are at least five effective
approaches to avoiding underfitting and overfitting, and hence getting good
generalization: 

 o Model selection 
 o Jittering 
 o Weight decay 
 o Early stopping 
 o Bayesian estimation 

The complexity of a network is related to both the number of weights and the
size of the weights. Model selection is concerned with the number of
weights, and hence the number of hidden units and layers. The other
approaches listed above are concerned, directly or indirectly, with the size
of the weights. 

The issue of overfitting/underfitting is intimately related to the
bias/variance tradeoff in nonparametric estimation (Geman, Bienenstock, and
Doursat 1992). 

References: 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

------------------------------------------------------------------------

Subject: What is jitter? (Training with noise) 
===============================================

Jitter is artificial noise deliberately added to the inputs during training.
Training with jitter is closely related to regularization methods such as
weight decay and ridge regression. It is also a form of smoothing related to
kernel regression (see "What is GRNN?") 

Training with jitter works because the functions that we want NNs to learn
are mostly smooth. NNs can learn functions with discontinuities, but the
discontinuities must be restricted to sets of measure zero if our network is
restricted to a finite number of hidden units. 

In other words, if we have two cases with similar inputs, the desired
outputs will usually be similar. That means we can take any training case
and generate new training cases by adding small amounts of jitter to the
inputs. As long as the amount of jitter is sufficiently small, we can assume
that the desired output will not change enough to be of any consequence, so
we can just use the same target value. The more training cases, the merrier,
so this looks like a convenient way to improve training. But too much jitter
will obviously produce garbage, while too little jitter will have little
effect (Koistinen and Holmstro\"m 1992). 

When studying nonlinear models such as feedforward NNs, it is often helpful
first to consider what happens in linear models, and then to see what
difference the nonlinearity makes. So let's consider training with jitter in
a linear model. Notation: 

   x_ij is the value of the jth input (j=1, ..., p) for the
        ith training case (i=1, ..., n).
   X={x_ij} is an n by p matrix.
   y_i is the target value for the ith training case.
   Y={y_i} is a column vector.

Without jitter, the least-squares weights are B = inv(X'X)X'Y, where
"inv" indicates a matrix inverse and "'" indicates transposition. Note that
if we replicate each training case c times, or equivalently stack c copies
of the X and Y matrices on top of each other, the least-squares weights are
inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before. 

With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ...,
c, where z_ijk is produced by some random number generator, usually with
a normal distribution with mean 0 and standard deviation s, and the 
z_ijk's are all independent. In place of the n by p matrix X, this
gives us a big matrix, say Q, with cn rows and p columns. To compute the
least-squares weights, we need Q'Q. Let's consider the jth diagonal
element of Q'Q, which is 

                   2           2       2
   sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
   i,k                i,k

which is approximately, for c large, 

             2     2
   c(sum x_ij  + ns ) 
      i

which is c times the corresponding diagonal element of X'X plus ns^2.
Now consider the u,vth off-diagonal element of Q'Q, which is 

   sum (x_iu+z_iuk)(x_iv+z_ivk)
   i,k

which is approximately, for c large, 

   c(sum x_iu x_iv)
      i

which is just c times the corresponding element of X'X. Thus, Q'Q equals
c(X'X+ns^2I), where I is an identity matrix of appropriate size.
Similar computations show that the crossproduct of Q with the target values
is cX'Y. Hence the least-squares weights with jitter of variance s^2 are
given by 

       2                2                    2
   B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y

In the statistics literature, B(ns^2) is called a ridge regression
estimator with ridge value ns^2. 

If we were to add jitter to the target values Y, the cross-product X'Y
would not be affected for large c for the same reason that the off-diagonal
elements of X'X are not afected by jitter. Hence, adding jitter to the
targets will not change the optimal weights; it will just slow down
training. 

The ordinary least squares training criterion is (Y-XB)'(Y-XB).
Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B,
where d is the decay rate. Weight decay can also be implemented by
inventing artificial training cases. Augment the training data with p new
training cases containing the matrix dI for the inputs and a zero vector
for the targets. To put this in a formula, let's use A;B to indicate the
matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D.
Thus the augmented inputs are X;dI and the augmented targets are Y;0,
where 0 indicates the zero vector of the appropriate size. The squared error
for the augmented training data is: 

   (Y;0-(X;dI)B)'(Y;0-(X;dI)B)
   = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
   = Y'Y - 2Y'XB + B'(X'X+d^2I)B
   = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
   = (Y-XB)'(Y-XB)+d^2B'B

which is the weight-decay training criterion. Thus the weight-decay
estimator is: 

    inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y

which is the same as the jitter estimator B(d^2), i.e. jitter with
variance d^2/n. The equivalence between the weight-decay estimator and
the jitter estimator does not hold for nonlinear models unless the jitter
variance is small relative to the curvature of the nonlinear function.
However, the equivalence of the two estimators for linear models suggests
that they will often produce similar results even for nonlinear models. 

B(0) is obviously the ordinary least-squares estimator. It can be shown
that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in
other words, adding jitter causes the weights to shrink. It can also be
shown that under the usual statistical assumptions, there always exists some
value of ns^2 > 0 such that B(ns^2) provides better expected
generalization than B(0). Unfortunately, there is no way to calculate a
value of ns^2 from the training data that is guaranteed to improve
generalization. There are other types of shrinkage estimators called Stein
estimators that do guarantee better generalization than B(0), but I'm not
aware of a nonlinear generalization of Stein estimators applicable to neural
networks. 

The statistics literature describes numerous methods for choosing the ridge
value. The most obvious way is to estimate the generalization error by
cross-validation, generalized cross-validation, or bootstrapping, and to
choose the ridge value that yields the smallest such estimate. There are
also quicker methods based on empirical Bayes estimation, one of which
yields the following formula, useful as a first guess: 

    2    p(Y-XB(0))'(Y-XB(0))
   s   = --------------------
    1      n(n-p)B(0)'B(0)

You can iterate this a few times: 

    2      p(Y-XB(0))'(Y-XB(0))
   s     = --------------------
    l+1              2     2
            n(n-p)B(s )'B(s )
                     l     l

Note that the more training cases you have, the less noise you need. 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Koistinen, P. and Holmstro\"m, L. (1992) "Kernel regression and
   backpropagation training with noise," NIPS4, 1033-1039. 

   Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods,
   NY: Marcel-Dekker. 

------------------------------------------------------------------------

Subject: What is early stopping? 
=================================

NN practitioners often use nets with many times as many parameters as
training cases. E.g., Nelson and Illingworth (1991, p. 165) discuss training
a network with 16,219 parameters with only 50 training cases! The method
used is called early stopping or stopped training and proceeds as follows: 

1. Divide the available data into training and validation sets. 
2. Use a large number of hidden units. 
3. Use very small random initial values. 
4. Use a slow learning rate. 
5. Compute the validation error rate periodically during training. 
6. Stop training when the validation error rate "starts to go up". 

It is crucial to realize that the validation error is not a good estimate
of the generalization error. One method for getting an unbiased estimate of
the generalization error is to run the net on a third set of data, the test
set, that is not used at all during the training process. For other methods,
see "How can generalization error be estimated?" 

Early stopping has several advantages: 

 o It is fast. 
 o It can be applied successfully to networks in which the number of weights
   far exceeds the sample size. 
 o It requires only one major decision by the user: what proportion of
   validation cases to use. 

But there are several unresolved practical issues in early stopping: 

 o How many cases do you assign to the training and validation sets? Rules
   of thumb abound, but appear to be no more than folklore. The only
   systematic results known to the FAQ maintainer are in Sarle (1995), which
   deals only with the case of a single input. Amari et al. (1995) attempts
   a theoretical approach but contains serious errors that completely
   invalidate the results, especially the incorrect assumption that the
   direction of approach to the optimum is distributed isotropically. 
 o Do you split the data into training and validation sets randomly or by
   some systematic algorithm? 
 o How do you tell when the validation error rate "starts to go up"? It may
   go up and down numerous times during training. The safest approach is to
   train to convergence, then go back and see which iteration had the lowest
   validation error. For more elaborate algorithms, see section 3.3 of 
   /pub/papers/techreports/1994/1994-21.ps.gz. 

Statisticians tend to be skeptical of stopped training because it appears to
be statistically inefficient due to the use of the split-sample technique;
i.e., neither training nor validation makes use of the entire sample, and
because the usual statistical theory does not apply. However, there has been
recent progress addressing both of the above concerns (Wang 1994). 

Early stopping is closely related to ridge regression. If the learning rate
is sufficiently small, the sequence of weight vectors on each iteration will
approximate the path of continuous steepest descent down the error function.
Early stopping chooses a point along this path that optimizes an estimate of
the generalization error computed from the validation set. Ridge regression
also defines a path of weight vectors by varying the ridge value. The ridge
value is often chosen by optimizing an estimate of the generalization error
computed by cross-validation, generalized cross-validation, or bootstrapping
(see "What are cross-validation and bootstrapping?") There always exists a
positive ridge value that will improve the expected generalization error in
a linear model. A similar result has been obtained for early stopping in
linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge
path lies close to, but does not coincide with, the path of continuous
steepest descent; in nonlinear models, the two paths can diverge widely. The
relationship is explored in more detail by Sjo\"berg and Ljung (1992). 

References: 

   S. Amari, N.Murata, K.-R. Muller, M. Finke, H. Yang. Asymptotic
   Statistical Theory of Overtraining and Cross-Validation. METR 95-06,
   1995, Department of Mathematical Engineering and Information Physics,
   University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan. 

   Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model
   selection by nonconvergent methods," Neural Networks, 6, 771-783. 

   Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural
   Nets, Reading, MA: Addison-Wesley. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

   Sjo\"berg, J. and Ljung, L. (1992), "Overtraining, Regularization, and
   Searching for Minimum in Neural Networks," Technical Report
   LiTH-ISY-I-1297, Department of Electrical Engineering, Linkoping
   University, S-581 83 Linkoping, Sweden, http://www.control.isy.liu.se . 

   Wang, C. (1994), A Theory of Generalisation in Learning Machines with
   Neural Network Application, Ph.D. thesis, University of Pennsylvania. 

   Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and
   Effective Machine Complexity in Learning," NIPS6, 303-310. 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,
   335-342. 

------------------------------------------------------------------------

Subject: What is weight decay? 
===============================

Weight decay adds a penalty term to the error function. The usual penalty is
the sum of squared weights times a decay constant. In a linear model, this
form of weight decay is equivalent to ridge regression. See "What is
jitter?" for more explanation of ridge regression. 

Weight decay is a subset of regularization methods. The penalty term in
weight decay, by definition, penalizes large weights. Other regularization
methods may involve not only the weights but various derivatives of the
output function (Bishop 1995). 

The weight decay penalty term causes the weights to converge to smaller
absolute values than they otherwise would. Large weights can hurt
generalization in two different ways. Excessively large weights leading to
hidden units can cause the output function to be too rough, possibly with
near discontinuities. Excessively large weights leading to output units can
cause wild outputs far beyond the range of the data if the output activation
function is not bounded to the same range as the data. 

Other penalty terms besides the sum of squared weights are sometimes used. 
Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses: 

          (w_i)^2
   sum -------------
    i  (w_i)^2 + c^2

where w_i is the ith weight and c is a user-specified constant. Whereas
decay using the sum of squared weights tends to shrink the large
coefficients more than the small ones, weight elimination tends to shrink
the small coefficients more, and is therefore more useful for suggesting
subset models (pruning). 

The generalization ability of the network can depend crucially on the decay
constant, especially with small training sets. One approach to choosing the
decay constant is to train several networks with different amounts of decay
and estimate the generalization error for each; then choose the decay
constant that minimizes the estimated generalization error. Weigend,
Rumelhart, and Huberman (1991) iteratively update the decay constant during
training. 

There are other important considerations for getting good results from
weight decay. You must either standardize the inputs and targets, or adjust
the penalty term for the standard deviations of all the inputs and targets.
It is usually a good idea to omit the biases from the penalty term. 

A fundamental problem with weight decay is that different types of weights
in the network will usually require different decay constants for good
generalization. At the very least, you need three different decay constants
for input-to-hidden, hidden-to-hidden, and hidden-to-output weights.
Adjusting all these decay constants to produce the best estimated
generalization error often requires vast amounts of computation. 

Fortunately, there is a superior alternative to weight decay: hierarchical
Bayesian estimation. Bayesian estimation makes it possible to estimate
efficiently numerous decay constants. See "What is Bayesian estimation?" 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
   Generalization by weight-elimination with application to forecasting. In:
   R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural
   Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann. 

------------------------------------------------------------------------

Subject: What is Bayesian estimation? 
======================================

I haven't written an answer for this yet, but here are some references: 

   Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
   (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V.
   (North-Holland). 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995), Bayesian
   Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5. 

   MacKay, D.J.C. (1992), "A practical Bayesian framework for
   backpropagation networks," Neural Computation, 4, 448-472. 

   MacKay, D.J.C. (199?), "Probable networks and plausible predictions--a
   review of practical Bayesian methods for supervised neural networks," 
   ftp://mraos.ra.phy.cam.ac.uk/pub/mackay/network.ps.Z. 

   Neal, R.M. (1995), Bayesian Learning for Neural Networks, Ph.D. thesis,
   University of Toronto, ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z. 

   O'Hagan, A. (1985), "Shoulders in hierarchical models," in Bernardo et
   al. (1985), 697-710. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

------------------------------------------------------------------------

Subject: How many hidden units should I use? 
=============================================

Some books and articles offer "rules of thumb" for choosing a topopology --
Ninputs plus Noutputs divided by two, maybe with a square root in there
somewhere -- but such rules are total garbage. There is no way to determine
a good network topology just from the number of inputs and outputs. It
depends critically on the number of training cases, the amount of noise, and
the complexity of the function or classification you are trying to learn.
There are problems with one input and one output that require thousands of
hidden units, and problems with a thousand inputs and a thousand outputs
that require only one hidden unit, or none at all. 

Other rules relate to the number of cases available: use at most so many
hidden units that the number of weights in the network times 10 is smaller
than the number of cases. Such rules are only concerned with overfitting and
are unreliable as well. All one can say is that if the number of training
cases is much larger (but no one knows exactly how much larger) than the
number of weights, you are unlikely to get overfitting, but you may suffer
from underfitting. Geman, Bienenstock, and Doursat (1992) discuss how the
number of hidden units affects the bias/variance trade-off. 

An intelligent choice of the number of hidden units depends on whether you
are using early stopping or some other form of regularization. If not, you
must simply try many networks with different numbers of hidden units, 
estimate the generalization error for each one, and choose the network with
the minimum estimated generalization error. However, there is little point
in trying a network with more weights than training cases, since such a
large network is very likely to overfit. 

If you are using early stopping, it is essential to use lots of hidden units
to avoid bad local optima (Sarle 1995). There seems to be no upper limit on
the number of hidden units, other than that imposed by computer time and
memory requirements. Weigend (1994) makes this assertion, but provides only
one example as evidence. Tetko, Livingstone, and Luik (1995) provide
simulation studies that are more convincing. The FAQ maintainer obtained
similar results in conjunction with the simulations in Sarle (1995), but
those results are not reported in the paper for lack of space. On the other
hand, there seems to be no advantage to using more hidden units than you
have training cases, since bad local minima do not occur with so many hidden
units. 

If you are using weight decay or Bayesian estimation, you can also use lots
of hidden units (Neal 1995). However, it is not strictly necessary to do so,
because other methods are available to avoid local minima, such as multiple
random starts and simulated annealing (such methods are not safe to use with
early stopping). You can use one network with lots of hidden units, or you
can try different networks with different numbers of hidden units, and
choose on the basis of estimated generalization error. With weight decay or
MAP Bayesian estimation, it is prudent to keep the number of weights less
than half the number of training cases. 

Bear in mind that with two or more inputs, an MLP with one hidden layer
containing just a few units can fit only a limited variety of target
functions. Even simple, smooth surfaces such as a Gaussian bump in two
dimensions may require 20 to 50 hidden units for a close approximation.
Networks with a smaller number of hidden units often produce spurious ridges
and valleys in the output surface. Training a network with 20 hidden units
will typically require anywhere from 150 to 2500 training cases if you do
not use early stopping or regularization. Hence, if you have a smaller
training set than that, it is usually advisable to use early stopping or
regularization rather than to restrict the net to a small number of hidden
units. 

References: 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Neal, R.M. (1995), Bayesian Learning for Neural Networks, Ph.D. thesis,
   University of Toronto, ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," to appear in Proceedings of the 27th Symposium on the
   Interface, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very
   large compressed postscript file, 747K, 10 pages) 

   Tetko, I.V., Livingstone, D.J., and Luik, A.I. (1995), "Neural Network
   Studies. 1. Comparison of Overfitting and Overtraining," J. Chem. Info.
   Comp. Sci., 35, 826-833. 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,
   335-342. 

------------------------------------------------------------------------

Subject: How can generalization error be estimated? 
====================================================

There are many methods for estimating generalization error. 

Single-sample statistics: AIC, SBC, FPE, Mallows' C_p, etc. 
   In linear models, statistical theory provides several simple estimators
   of the generalization error under various sampling assumptions
   (Darlington 1968, Efron and Tibshirani 1993). These estimators adjust the
   training error for the number of weights being estimated, and in some
   cases for the noise variance if that is known. See 
   ftp://ftp.sas.com/pub/neural/tnn3.html for some formulas. These
   statistics can also be used as crude estimates of the generalization
   error in nonlinear models when you have a "large" training set.
   Correcting these statistics for nonlinearity requires substantially more
   computation (Moody 1992), and the theory does not always hold for neural
   networks due to violations of the regularity conditions. 
Split-sample validation. 
   The most commonly used method for estimating generalization error in
   neural networks is to reserve part of the data as a test set, which must
   not be used in any way during training. The test set must be a
   representative sample of the cases that you want to generalize to. After
   training, run the network on the test set, and the error on the test set
   provides an unbiased estimate of the generalization error, provided that
   the test set was chosen randomly. The disadvantage of split-sample
   validation is that it reduces the amount of data available for both
   training and validation. See Weiss and Kulikowski (1991). 
Cross-validation (e.g., leave one out). 
   Cross-validation is an improvement on split-sample validation that allows
   you to use all of the data for training. The disadvantage of
   cross-validation is that you have to retrain the net many times. See 
   "What are cross-validation and bootstrapping?". 
Bootstrapping. 
   Bootstrapping is an improvement on cross-validation that often provides
   better estimates of generalization error at the cost of even more
   computing time. See "What are cross-validation and bootstrapping?". 

If you use any of the above methods to choose which of several different
networks to use for prediction purposes, the estimate of the generalization
error of the best network will be optimistic. For example, if you train
several networks using one data set, and use a second (validation set) data
set to decide which network is best, you must use a third (test set) data
set to obtain an unbiased estimate of the generalization error of the chosen
network. Hjorth (1994) explains how this principle extends to
cross-validation and bootstrapping. 

References: 

   Darlington, R.B. (1968), "Multiple Regression in Psychological Research
   and Practice," Psychological Bulletin, 69, 161-182. 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", NIPS 4,
   847-854. 

   Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

------------------------------------------------------------------------

Subject: What are cross-validation and bootstrapping? 
======================================================

Cross-validation and bootstrapping are both methods for estimating
generalization error based on "resampling". In k-fold cross-validation, you
divide the data into k subsets of equal size. You train the net k times,
each time leaving out one of the subsets from training, but using only the
omitted subset to compute whatever error criterion interests you. If k
equals the sample size, this is called leave-one-out cross-validation. A
more elaborate and expensive version of cross-validation involves leaving
out all possible subsets of a given size. 

Note that cross-validation is quite different from the "split-sample" or
"hold-out" method that is commonly used for early stopping in neural nets.
In the split-sample method, only a single subset (the validation set) is
used to estimate the error function, instead of k different subsets; i.e.,
there is no "crossing". While various people have suggested that
cross-validation be applied to early stopping, the proper way of doing that
is not obvious. 

Cross-validation is also easily confused with jackknifing. Both involve
omitting each training case in turn and retraining the network on the
remaining subset. But cross-validation is used to estimate generalization
error, while the jackknife is used to estimate the bias of a statistic. In
the jackknife, you compute some statistic of interest in each subset of the
data. The average of these subset statistics is compared with the
corresponding statistic computed from the entire sample in order to estimate
the bias of the latter. You can also get a jackknife estimate of the
standard error of a statistic. 

Leave-one-out cross-validation often works well for continuous error
functions such as the mean squared error, but it may perform poorly for
noncontinuous error functions such as the number of misclassified cases. In
the latter case, k-fold cross-validation is preferred. But if k gets too
small, the error estimate is pessimistically biased because of the
difference in sample size between the full-sample analysis and the
cross-validation analyses. A value of 10 for k is popular. 

Bootstrapping seems to work better than cross-validation in many cases. In
the simplest form of bootstrapping, instead of repeatedly analyzing subsets
of the data, you repeatedly analyze subsamples of the data. Each subsample
is a random sample with replacement from the full sample. Depending on what
you want to do, anywhere from 200 to 2000 subsamples might be used. There
are many more sophisticated bootstrap methods that can be used not only for
estimating generalization error but also for estimating confidence bounds
for network outputs. 

References: 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

------------------------------------------------------------------------

Subject: Should I normalize/standardize/rescale the
===================================================
data? 
======

First, some definitions. "Rescaling" a vector means to add or subtract a
constant and then multiply or divide by a constant, as you would do to
change the units of measurement of the data, for example, to convert a
temperature from Celsius to Fahrenheit. 

"Normalizing" a vector most often means dividing by a norm of the vector,
for example, to make the Euclidean length of the vector equal to one. In the
NN literature, "normalizing" also often refers to rescaling by the minimum
and range of the vector, to make all the elements lie between 0 and 1. 

"Standardizing" a vector most often means subtracting a measure of location
and dividing by a measure of scale. For example, if the vector contains
random values with a Gaussian distribution, you might subtract the mean and
divide by the standard deviation, thereby obtaining a "standard normal"
random variable with mean 0 and standard deviation 1. 

However, all of the above terms are used more or less interchangeably
depending on the customs within various fields. Since the FAQ maintainer is
a statistician, he is going to use the term "standardize" because that is
what he is accustomed to. 

Now the question is, should you do any of these things to your data? The
answer is, it depends. 

There is a common misconception that the inputs to a multilayer perceptron
must be in the interval [0,1]. There is in fact no such requirement,
although there often are benefits to standardizing the inputs as discussed
below. 

If your output activation function has a range of [0,1], then obviously you
must ensure that the target values lie within that range. But it is
generally better to choose an output activation function suited to the
distribution of the targets than to force your data to conform to the output
activation function. See "Why use activation functions?" 

When using an output activation with a range of [0,1], some people prefer to
rescale the targets to a range of [.1,.9]. I suspect that the popularity of
this gimmick is due to the slowness of standard backprop. But using a target
range of [.1,.9] for a classification task gives you incorrect posterior
probability estimates, and it is quite unnecessary if you use an efficient
training algorithm (see "What are conjugate gradients, Levenberg-Marquardt,
etc.?") 

Now for some of the gory details: note that the training data form a matrix.
You could set up this matrix so that each case forms a row, and the inputs
and target values form columns. You could conceivably standardize the rows
or the columns or both or various other things, and these different ways of
choosing vectors to standardize will have quite different effects on
training. 

Subquestion: Should I standardize the input column vectors?
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

That depends primarily on how the network combines inputs to compute the net
input to the next (hidden or output) layer. If the inputs are combined
linearly, as in a multilayer perceptron, then it is rarely strictly
necessary to standardize the inputs, at least in theory. The reason is that
any rescaling of an input vector can be effectively undone by changing the
corresponding weights and biases, leaving you with the exact same outputs as
you had before. However, there are a variety of practical reasons why
standardizing the inputs can make training faster and reduce the chances of
getting stuck in local optima. 

If the inputs are combined via some distance function, such as Euclidean
distance as in a radial basis-function network, standardizing inputs can be
crucial. Rescaling an input cannot be undone by adjusting the weights. The
contribution of an input will depend heavily on its variability relative to
other inputs. If one input has a range of 0 to 1, while another input has a
range of 0 to 1,000,000, then the contribution of the first input to the
distance will be swamped by the second input. So it is essential to rescale
the inputs so that their variability reflects their importance, or at least
is not in inverse relation to their importance. For lack of better prior
information, it is common to standardize each input to the same range or the
same standard deviation. 

For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 

Subquestion: Should I standardize the target column vectors?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Standardizing targets is typically more a convenience for getting good
initial weights than a necessity. However, if you have two or more target
variables and your error function is scale-sensitive like the usual least
(mean) squares error function, then the variability of each target relative
to the others can effect how well the net learns that target. If one target
has a range of 0 to 1, while another target has a range of 0 to 1,000,000,
the net will expend most of its effort learning the second target to the
possible exclusion of the first. So it is essential to rescale the targets
so that their variability reflects their importance, or at least is not in
inverse relation to their importance. If the targets are of equal
importance, they should typically be standardized to the same range or the
same standard deviation. 

For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 

Subquestion: Should I standardize the input cases (row vectors)?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Whereas standardizing variables is usually beneficial, the effect of
standardizing cases (row vectors) depends on the particular data. Cases are
typically standardized only across the input variables, since including the
target variable(s) in the standardization would make prediction impossible. 

There are some kinds of networks, such as simple Kohonen nets, where it is
necessary to standardize the input cases to a common Euclidean length; this
is a side effect of the use of the inner product as a similarity measure. If
the network is modified to operate on Euclidean distances instead of inner
products, it is no longer necessary to standardize the input cases. 

Standardization of cases should be approached with caution because it
discards information. If that information is irrelevant, then standardizing
cases can be quite helpful. If that information is important, then
standardizing cases can be disastrous. Issues regarding the standardization
of cases must be carefully evaluated in every application. There are no
rules of thumb that apply to all applications. 

For more details, see: ftp://ftp.sas.com/pub/neural/tnn3.html. 

------------------------------------------------------------------------

Subject: What is ART?
=====================

ART stands for "Adaptive Resonance Theory", invented by Stephen Grossberg in
1976. ART encompasses a wide variety of neural networks based explicitly on
neurophysiology. ART networks are defined algorithmically in terms of
detailed differential equations intended as plausible models of biological
neurons. In practice, ART networks are implemented using analytical
solutions or approximations to these differential equations. 

ART comes in several flavors, both supervised and unsupervised. As discussed
by Moore (1988), the unsupervised ARTs are basically similar to many
iterative clustering algorithms in which each case is processed by: 

1. finding the "nearest" cluster seed (AKA prototype or template) to that
   case 
2. updating that cluster seed to be "closer" to the case 

where "nearest" and "closer" can be defined in hundreds of different ways.
In ART, the framework is modified slightly by introducing the concept of
"resonance" so that each case is processed by: 

1. finding the "nearest" cluster seed that "resonates" with the case 
2. updating that cluster seed to be "closer" to the case 

"Resonance" is just a matter of being within a certain threshold of a second
similarity measure. A crucial feature of ART is that if no seed resonates
with the case, a new cluster is created as in Hartigan's (1975) leader
algorithm. This feature is said to solve the stability/plasticity dilemma. 

ART has its own jargon. For example, data are called an arbitrary sequence
of input patterns. The current training case is stored in short term memory
and cluster seeds are long term memory. A cluster is a maximally compressed
pattern recognition code. The two stages of finding the nearest seed to the
input are performed by an Attentional Subsystem and an Orienting Subsystem,
the latter of which performs hypothesis testing, which simply refers to the
comparison with the vigilance threshhold, not to hypothesis testing in the
statistical sense. Stable learning means that the algorithm converges. So
the oft-repeated claim that ART algorithms are "capable of rapid stable
learning of recognition codes in response to arbitrary sequences of input
patterns" merely means that ART algorithms are clustering algorithms that
converge; it does not mean, as one might naively assume, that the clusters
are insensitive to the sequence in which the training patterns are
presented--quite the opposite is true. 

There are various supervised ART algorithms that are named with the suffix
"MAP", as in Fuzzy ARTMAP. These algorithms cluster both the inputs and
targets and associate the two sets of clusters. The effect is somewhat
similar to counterpropagation. The main disadvantage of the ARTMAP
algorithms is that they have no mechanism to avoid overfitting and hence
should not be used with noisy data. 

For more information, see the ART FAQ at http://www.wi.leidenuniv.nl/art/
and the "ART Headquarters" at Boston University, http://cns-web.bu.edu/. For
a different view of ART, see Sarle, W.S. (1995), "Why Statisticians Should
Not FART," ftp://ftp.sas.com/pub/neural/fart.doc. 

References: 

   Carpenter, G.A., Grossberg, S. (1996), "Learning, Categorization, Rule
   Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H., ed.
   (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, pp.
   1.3-1.45. 

   Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley. 

   Kasuba, T. (1993), "Simplified Fuzzy ARTMAP," AI Expert, 8, 18-25. 

   Moore, B. (1988), "ART 1 and Pattern Clustering," in Touretzky, D.,
   Hinton, G. and Sejnowski, T., eds., Proceedings of the 1988
   Connectionist Models Summer School, 174-185, San Mateo, CA: Morgan
   Kaufmann. 

------------------------------------------------------------------------

Subject: What is PNN?
=====================

PNN or "Probabilistic Neural Network" is Donald Specht's term for kernel
discriminant analysis. You can think of it as a normalized RBF network in
which there is a hidden unit centered at every training case. These RBF
units are called "kernels" and are usually probability density functions
such as the Gaussian. The hidden-to-output weights are usually 1 or 0; for
each hidden unit, a weight of 1 is used for the connection going to the
output that the case belongs to, while all other connections are given
weights of 0. Alternatively, you can adjust these weights for the prior
probabilities of each class. So the only weights that need to be learned are
the widths of the RBF units. These widths (often a single width is used) are
called "smoothing parameters" or "bandwidths" and are usually chosen by
cross-validation or by more esoteric methods that are not well-known in the
neural net literature; gradient descent is not used. 

Specht's claim that a PNN trains 100,000 times faster than backprop is at
best misleading. While they are not iterative in the same sense as backprop,
kernel methods require that you estimate the kernel bandwidth, and this
requires accessing the data many times. Furthermore, computing a single
output value with kernel methods requires either accessing the entire
training data or clever programming, and either way is much slower than
computing an output with a feedforward net. And there are a variety of
methods for training feedforward nets that are much faster than standard
backprop. So depending on what you are doing and how you do it, PNN may be
either faster or slower than a feedforward net. 

PNN is a universal approximator for smooth class-conditional densities, so
it should be able to solve any smooth classification problem given enough
data. The main drawback of PNN is that, like kernel methods in general, it
suffers badly from the curse of dimensionality. PNN cannot ignore irrelevant
inputs without major modifications to the basic algorithm. So PNN is not
likely to be the top choice if you have more than 5 or 6 nonredundant
inputs. 

But if all your inputs are relevant, PNN has the very useful ability to tell
you whether a test case is similar (i.e. has a high density) to any of the
training data; if not, you are extrapolating and should view the output
classification with skepticism. This ability is of limited use when you have
irrelevant inputs, since the similarity is measured with respect to all of
the inputs, not just the relevant ones. 

References: 

   Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press. 

   McLachlan, G.J. (1992) Discriminant Analysis and Statistical Pattern
   Recognition, Wiley. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine
   Learning, Neural and Statistical Classification, Ellis Horwood. 

   Scott, D.W. (1992) Multivariate Density Estimation, Wiley. 

   Specht, D.F. (1990) "Probabilistic neural networks," Neural Networks, 3,
   110-118. 

------------------------------------------------------------------------

Subject: What is GRNN?
======================

GRNN or "General Regression Neural Network" is Donald Specht's term for
Nadaraya-Watson kernel regression, also reinvented in the NN literature by
Schi\oler and Hartmann. You can think of it as a normalized RBF network in
which there is a hidden unit centered at every training case. These RBF
units are called "kernels" and are usually probability density functions
such as the Gaussian. The hidden-to-output weights are just the target
values, so the output is simply a weighted average of the target values of
training cases close to the given input case. The only weights that need to
be learned are the widths of the RBF units. These widths (often a single
width is used) are called "smoothing parameters" or "bandwidths" and are
usually chosen by cross-validation or by more esoteric methods that are not
well-known in the neural net literature; gradient descent is not used. 

GRN is a universal approximator for smooth functions, so it should be able
to solve any smooth function-approximation problem given enough data. The
main drawback of GRNN is that, like kernel methods in general, it suffers
badly from the curse of dimensionality. GRNN cannot ignore irrelevant inputs
without major modifications to the basic algorithm. So GRNN is not likely to
be the top choice if you have more than 5 or 6 nonredundant inputs. 

References: 

   Caudill, M. (1993), "GRNN and Bear It," AI Expert, Vol. 8, No. 5 (May),
   28-33. 

   Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ.
   Press. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Nadaraya, E.A. (1964) "On estimating regression", Theory Probab. Applic.
   10, 186-90. 

   Schi\oler, H. and Hartmann, U. (1992) "Mapping Neural Network Derived
   from the Parzen Window Estimator", Neural Networks, 5, 903-909. 

   Specht, D.F. (1968) "A practical technique for estimating general
   regression surfaces," Lockheed report LMSC 6-79-68-6, Defense Technical
   Information Center AD-672505. 

   Specht, D.F. (1991) "A Generalized Regression Neural Network", IEEE
   Transactions on Neural Networks, 2, Nov. 1991, 568-576. 

   Watson, G.S. (1964) "Smooth regression analysis", Sankhy{\=a}, Series A,
   26, 359-72. 

------------------------------------------------------------------------

Subject: What about Genetic Algorithms?
=======================================

There are a number of definitions of GA (Genetic Algorithm). A possible one
is

  A GA is an optimization program
  that starts with
  a population of encoded procedures,       (Creation of Life :-> )
  mutates them stochastically,              (Get cancer or so :-> )
  and uses a selection process              (Darwinism)
  to prefer the mutants with high fitness
  and perhaps a recombination process       (Make babies :-> )
  to combine properties of (preferably) the succesful mutants.

Genetic Algorithms are just a special case of the more general idea of
``evolutionary computation''. There is a newsgroup that is dedicated to the
field of evolutionary computation called comp.ai.genetic. It has a detailed
FAQ posting which, for instance, explains the terms "Genetic Algorithm",
"Evolutionary Programming", "Evolution Strategy", "Classifier System", and
"Genetic Programming". That FAQ also contains lots of pointers to relevant
literature, software, other sources of information, et cetera et cetera.
Please see the comp.ai.genetic FAQ for further information. 

------------------------------------------------------------------------

Subject: What about Fuzzy Logic?
================================

Fuzzy Logic is an area of research based on the work of L.A. Zadeh. It is a
departure from classical two-valued sets and logic, that uses "soft"
linguistic (e.g. large, hot, tall) system variables and a continuous range
of truth values in the interval [0,1], rather than strict binary (True or
False) decisions and assignments.

Fuzzy logic is used where a system is difficult to model exactly (but an
inexact model is available), is controlled by a human operator or expert, or
where ambiguity or vagueness is common. A typical fuzzy system consists of a
rule base, membership functions, and an inference procedure.

Most Fuzzy Logic discussion takes place in the newsgroup comp.ai.fuzzy, but
there is also some work (and discussion) about combining fuzzy logic with
Neural Network approaches in comp.ai.neural-nets.

References: 

   Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY:
   McGraw-Hill, ISBN 0-07-011189-8. 

   Klir, G.J. and Folger, T.A.(1988), Fuzzy Sets, Uncertainty, and
   Information, Englewood Cliffs, N.J.: Prentice-Hall. 

   Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
   N.J.: Prentice-Hall. 

------------------------------------------------------------------------

Next part is part 3 (of 7). Previous part is part 1. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
