.ce
Fast Back-Propagation
.ce
Copyright (c) 1990 by Donald R. Tveter


.ul
Introduction

   The programs described below were produced for my own use in studying
back-propagation and for doing experiments that are found in my
introduction to Artificial Intelligence textbook, \fIThe Basis of
Artificial Intelligence\fR, to be published by Computer Science Press.
I have copyrighted these files but I hereby give permission to anyone to
use them for experimentation, educational purposes or to redistribute
them on a not for profit basis.  All others that may want to use, change
or redistribute these programs for commercial purposes, should contact
me by mail at:

.na
.nf
                  Dr. Donald R. Tveter
                  5228 N. Nashville Ave.
                  Chicago, Illinois   60656
                  USENET:  drt\@chinet.chi.il.us
.ad
.fi

Also, I would be interested in hearing your suggestions, bug reports
and major successes or failures.

   There are four simulators that can be constructed from the
included files.  The program, rbp, does back-propagation using double
precision floating point weights and arithmetic.  The program, bp, does
back-propagation using 16-bit integer weights, 16 and 32-bit integer
arithmetic and some double precision floating point arithmetic.  The
program, sbp, uses 16-bit integer symmetric weights but only allows
two-layer networks.  The program srbp does the same using 64-bit
floating point weights.  The purpose of sbp and srbp is to produce
networks that can be used with the Boltzman machine relaxation
algorithm (not included).

   In most cases, the 16-bit integer programs are the most useful,
because they are the fastest.  With a 10 MHz 68010, connections can be
processed at up to about 45,000 per second and weight changes can be
done at up to about 25,000 per second.  These values depend on the exact
problem.  The integer versions will probably be faster on most machines
than the versions that use real arithmetic.  Unfortunately, sometimes
16-bit integer weights don't have enough range or precision and then
using the floating point versions may be necessary.  Many other speed-up
techniques are included in these programs.

.ul
Making the Simulators

   To make a particular executable file, use the makefile given
with the data files and make any or all of them like so:

.ce
make bp
.ce
 make sbp
.ce
 make rbp
.ce
 make srbp

One option exists for bp and sbp.  If your compiler is smart enough
to divide by 1024 by shifting, use "-DSMART".

   To make a record of all the input and output from the programs,
the following small UNIX command file I call record can be used:

.na
.nf
trap "" 2
outfile="${1}.record"
if test -f $outfile 
   then
      rm $outfile
   fi
echo $outfile
(tee -a $outfile | $*) | tee -a $outfile
process=`ps | grep tee | cut -c1-6`
kill $process
.ad
.fi

For example to make a record of all the input and output from the
program bp using data file, xor, use:

.ce
record bp xor


.ul
A Simple Example

  Each version would normally be called with the name of a file to read
commands from, as in:

.ce
bp xor

When no file name is specified, bp expects to take commands from the
keyboard (UNIX stdin file).  After the file name from the command line
is read and the commands in the file are executed, commands are then
taken from the keyboard.

   The commands are one letter commands.  Most commands have
optional parameters.  The `*' character is a comment.  It can be used
to make the remainder of the line a comment.  Here is an example of
an input file to do the xor problem:
           
.na
.nf
* input file for the xor problem
           
m 2 1 1           * make a 2-1-1 network
c 1 1 3 1         * add this extra connection
c 1 2 3 1         * add this extra connection
s 7               * seed the random number function
k 0 1             * give the network random weights

n 4               * read four new patterns into memory
1 0 1
0 0 0
0 1 1
1 1 0

e 0.5             * set eta to 0.5 (and eta2 to 0.05)
a 0.9             * set alpha to 0.9
.ad
.fi

In this example, the m command is a command to make a network.  The
numbers following it are the number of units for each layer.  The m
command connects adjacent layers with weights.  The following c
commands create extra connections from layer 1, unit 1 to layer 3,
unit 1 and from layer 1, unit 2 to layer 3, unit 1.  The `s' command
sets the seed for the random number function.  The `k' command then
gives the network random weights.  The `k' command has another use as
well.  It can be used to try to kick a network out of a local minimum.
Here, the meaning of "k 0 1" is to examine all the weights in the
network and for every weight equal to 0 (and they all start out at 0),
add in a random number between -1 and +1.  The `n' command
specifies four new patterns to be read into memory.  With the `n'
command, any old patterns that may have been present are removed.
There is also an `x' command that behaves like the `n' command, except
the `x' commands \fIadds\fR the extra patterns to the current training
set.  The input pattern comes first, followed by the output pattern.
The statement, e 0.5, sets eta, the learning rate, to 0.5 and eta2 from
the differential step size algorithm to one tenth this, or 0.05.  The
last line sets alpha, the momentum parameter, to 0.9.

   The above statements set up the network and when the list of
commands runs out, commands are taken from the keyboard.  The
following messages and prompt appears:

.na
.nf
.ne2
Fast Backpropagation Copyright (c) 1990 by Donald R. Tveter
taking commands from stdin now
[?!*AabCcEefHhijklmnoPpQqRrSstWwx]?
.ad
.fi

The square brackets enclose a list of the possible commands.
The `r' command is used to run the training algorithm.  Typing in "r 200
100" as shown below, means run 200 iterations through the patterns
and print the output patterns every 100 iterations:

.na
.nf
.ne3
[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? r 200 100
running . . .
.ne5
100 iterations, s 7, k 0 1.00, file = xor
  1  0.81  (0.03739)
  2  0.13  (0.01637)
  3  0.85  (0.02262)
  4  0.17  (0.02988)
.ne5
159 iterations, s 7, k 0 1.00, file = xor
  1  0.90  (0.00973)
  2  0.07  (0.00467)
  3  0.92  (0.00565)
  4  0.09  (0.00739)
patterns learned to within 0.10 at iteration 159
.ad
.fi

The program immediately prints out the "running . . ." message.  After
each 100 iterations, a header line giving some program parameters
is printed out, followed by the results that occur when each of the four
patterns is submitted to the network.  If the second number defining
how often to print out values is omitted, the values will not print
even when the learning is finished.  The values in parentheses at the
end of each line give the sum of the squared error on the output units
for each output pattern.  These error numbers are useful to see because
they give you some idea of how fast each pattern is being learned.
The program also reports that the patterns have been learned to within
the default tolerance of 0.1.  This check for the tolerance being met
is done for every learning iteration.  Sometimes in the integer version
the program will do a few extra iterations before declaring
the problem done.  This is because of truncation errors in the
arithmetic done to check for convergence.

   A particular test pattern can be input to the network with the `p'
command, as in:

.na
.nf
.ne2
[?!*AabCcEefhijklmnoPpQqRrSstwx]? p 1 0
     0.91 
.ad
.fi

To have the system evaluate a particular stored pattern, say pattern
number 4, use the `P' command as in:

.na
.nf
.ne2
[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? P4
  4  0.09  (0.00739)
.ad
.fi

To print all the values for all the training patterns without doing
any learning, type `P':

.na
.nf
.ne5
[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? P
  1  0.90  (0.00973)
  2  0.07  (0.00467)
  3  0.92  (0.00565)
  4  0.09  (0.00739)
.ad
.fi

   One thing you might want to know are the values of the weights
that have been produced.  To see this, there is the `w' command.
The `w' command gives the value of the weights leading into
a particular unit and also data about how the activation value of the
unit is computed.  Two integers after the w specify the layer and
unit number within the layer whose weights should be printed.  For
example, if you want the weights leading into the unit at layer 2,
position number 1, type:

.na
.nf
.ne6
[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? w 2 1
layer unit  unit value     weight         input from unit
  1      1    1.00000     7.27930             7.27930
  1      2    1.00000    -5.66797            -5.66797
  2      t    1.00000     2.74902             2.74902
                                      sum =   4.36035
.ad
.fi

In this example, the unit at layer 2, number 1 is receiving input from
units 1 and 2 in the previous (the input) layer and from a unit, t.
Unit t is the threshold unit.  The "unit value" column gives the value
of the input units for the last time some pattern was placed on the
input units.  In this case, the fourth pattern was the last one that the
network has seen.  The next column lists the weights on the connections
into the unit at (2,1).  The final column is the result from multiplying
together the unit value and the weight.  Beneath this column, the sum of
the inputs is given.

   Another important command is the help command.  It is the letter
`h' (not `?') followed by the letter of the command.  The help command
will give a brief summary of how to use the command.  Here, we type
h h for help with help:

.na
.nf
.ne3
[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? h h

h <letter> gives help for command <letter>.
.ad
.fi

   Finally, to end the program, the `q' (for quit) command is entered:

[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? q

.ul
Input and Output Formats

   The programs are able to read patterns in two different formats.  The
default input format is the compressed (condensed) format.  In it, each
value is one character and it is not necessary to have blanks between
the characters.  For example, in compressed format, the patterns for xor
could be written out in either of the following ways:

.ce
101               10 1
.ce
000               00 0
.ce
011               01 1
.ce
110               11 0

The second example is preferable because it makes it
easier to see the input and the output patterns.  Compressed format can
also be used to input patterns with the `p' command.
In addition to using 1 and 0 as input, the character, `?' can be used.
This character is initially defined to be 0.5, but it can be redefined
using the Q command like so:

.ce
Q 0.7

This sets the value of ? to 0.7.  Other valid input characters are the
letters, `h', `i', `j' and `k'.  The `h' stands for `hidden'.  Its
meaning in an input string is that the value at this point in the string
should be taken from the next unit in the second layer of the network.
Normally this will be the second layer of a three-layer network.  This
notation is useful for specifying simple recurrent
networks.  Naturally, `i', `j' and `k' stand for taking input
values from the third, fourth and fifth layers (if they exist).  A
simple example of a recurrent network is given later.

   The other input format for numbers is real.  The number portion must
start with a digit (.35 is not allowed, but 0.35 is).  Exponential
notation is not allowed.  Real numbers have to be separated by a space.
The `h', `i', `j', `k' and `?' characters are also allowed with real
input patterns.  To take input in this format, it is necessary
to set the input format to be real using the `f' (format) command as in:

.ce
f ir

To change back to the compressed format, use:

.ce
f ic

Output format is controlled with the `f' command as in:

.ce
f or
.ce
f oc
.ce
f oa

The first sets the output to real numbers.  The second sets the
output to be condensed mode where the value printed will be a `1' when
the unit value is greater than 1.0 - tolerance, a `^' when the value
is above 0.5 but less than 1.0 - tolerance, a `v' when the value is
less than 0.5 but greater than the tolerance.  Below the tolerance
value, a `0' is printed.  The tolerance can be changed using the `t'
command.  For example, to make all values greater than 0.8 print
as `1' and all values less than 0.2 print as `0', use:

.ce
t 0.2

Of course, this same tolerance value is also used to check to see if all
the patterns have converged.  The third output format is meant to
give "analog condensed" output.  In this format, a `c' is printed when
a value is close enough to its target value.  Otherwise, if the answer
is close to
1, a `1' is printed, if the answer is close to 0, a `0' is printed, if
the answer is above the target but not close to 1, a `^' is printed and
if the answer is below the target but not close to 0, a `v' is printed.
This output format is designed for
problems where the output is a real number, as for instance, when the
problem is to make a network learn sin(x).

   With the f command, a number of sub-commands can be put on one line
as in the following, where the input is set to real and the output
is set to analog condensed:

.ce
f ir oa

Also, for the sake of convenience, the output format (and only the
output format) can be set without using the `f', so that:

.ce
or

will also make the output format real.

   In the condensed formats, the default is to print a blank after every
10 values.  This can be altered using the `b' (for inserting breaks)
command.  The use for this command is to separate output values into
logical groups to make the output more readable.  For instance, you may
have 24 output units where it makes sense to insert blanks after the
4th, 7th and 19th positions.  To do this, specify:

.ce
b 4 7 19

Then for example, the output will look like:

.na
.nf
  1 10^0 10^ ^000v00000v0 01000 (0.17577)
  2 1010 01v 0^0000v00000 ^1000 (0.16341)
  3 0101 10^ 00^00v00000v 00001 (0.16887)
  4 0100 0^0 000^00000v00 00^00 (0.19880)
.ad
.fi

The `b' command allows up to 20 break positions to be specified.
The default output format is the real format with 10 numbers per
line.  For the output of real values, the `b' command specifies when to
print a carriage return, rather than when to print a blank.

   Sometimes the training set is so large that it is annoying to
have all the patterns print out every n iterations.  To get a summary of
how learning is going, instead of all these patterns, use "f s+".
Now, if the command in the xor problem was "r 200 50" the following
output summary will result:

.na
.nf
    50        0 learned      4 unlearned     0.48364 error/unit
   100        0 learned      4 unlearned     0.16528 error/unit
   150        3 learned      1 unlearned     0.08813 error/unit
   159        4 learned      0 unlearned     0.08203 error/unit
patterns learned to within 0.10 at iteration 159
.ad
.fi

The program counts up how many patterns were learned or not learned
in each training pass before the weights are updated.  Therefore, the
status is one iteration out of date.  The error/unit is the average
absolute value of the error on each unit for each pattern.  To switch
back to the longer report, use "f s-".  The P command will list all the
patterns no matter what the setting of the summary parameter is.

.ul
Saving and Restoring Weights and Related Values

   Sometimes the amount of time and effort needed to produce a set of
weights to solve a problem is so great that it is more convenient to
save the weights rather than constantly recalculate them.  Weights can
be saved as real values (the default) or as binary, to save space.  To
save the weights enter the command, `S'.  The weights are written on a
file called "weights".  The following file comes from the
xor problem:

.na
.nf
159r  file = xor
    7.2792968750
   -5.6679687500
    2.7490234375
    5.8486328125
   -5.0400390625
  -11.8574218750
    8.3193359375
.ad
.fi

To write the weights, the program starts with the second layer, writes
out the weights leading into these units in order with the threshold
weight last.  Then it moves on to
the third layer, and so on.  To restore these weights, type an `R' for
restore.  At this time, the program reads the header line and sets the
total number of iterations the program has gone through to be the first
number it finds on the header line.  It then reads the character
immediately after the number.  The `r' indicates that the weights will
be real numbers represented as character strings.  If the weights were
binary, the character would be a `b' rather than an `r'.  Also, if the
character is `b', the next character is read.  This next character
indicates how many bytes are used per value.  The integer versions, bp
and sbp write files with 2 bytes per weight, while the real versions,
rbp and srbp write files with 8 bytes per weight.  With this notation,
weight files written by one program can be read by the other.  A binary
weight format is specified within the `f' command by using "f wb".  A
real format is specified by using "f wr".  If your program specifies
that weights should be written in one format, but the weight file you
read from is different, a warning message will be printed.  There is no
check made to see if the number of weights on the file equals the number
of weights in the network.

   The above formats specify that only weights are written out and
this is all you need once the patterns have converged.  However, if
you're still training the network and want to break off training and
pick up the training from exactly the same point later, you need to save
the old weight changes when using momentum, and the parameters for the
delta-bar-delta method if you are using this technique.  To save these
extra parameters on the weights file, use "f wR" to write the extra
values as real and "f wB" to write the extra values as binary.

   In the above example, the command S, was used to save the weights
immediately.  Another alternative is to save weights at regular
intervals.  The command, S 100, will automatically save weights every
100 iterations the program does, that is, when the total iterations mod
100 = 0.  The initial rate at which to save weights is set at 100,000,
which generally means that no weights will ever be saved.

   Another use for saving weights has to do with trying to find the
proper parameters to quickly solve the problem.  Ordinarily, a high
rate of learning is desirable, but often too high a rate of learning
will increase the error, rather than decrease it.  In trying to find
the answer as quickly as possible, if the network seems to be
converging with the current parameters you can save the current weights
and increase the learning rate.  If this increased learning rate ruins
the convergence, then you can restore the weights you had before you
made this increase.


.ul
Initializing Weights and Giving the Network a `Kick'

   All the weights in the network initially start out at 0.  In
symmetric networks then, no learning may result because error signals
cancel themselves out.  Even in non-symmetric
networks, the training process will often converge faster if the weights
start out at small random values.  To do this, the `k' command will
take the network and alter the weights in the following ways.  Suppose
the command given is:

.ce
k 0 0.5

Now, if a weight is exactly 0, then the weight will be changed to a
random value between +0.5 and -0.5.  The above command can therefore be
used to initialize the weights in the network.  A more complex use of
the `k' command is to decrease the magnitude of large weights in the
network by a certain random amount.  For instance, in the following
command:

.ce
k 2 8

all the weights in the network that are greater than or equal to 2, will
be decreased by a random number between 0 and 8.  Weights
less than or equal to -2 will be increased by a random number
between 0 and 8.  The seed to the random number generator can be
changed using the `s' command as in "s 7".  The integer parameter in the
`s' command is of type, unsigned.

   Another method of giving a network a kick is to add hidden layer
units.  The command:

.ce
H 2 0.5

adds one unit to layer 2 of the network and all the weights that are
created are initialized to between - 0.5 and + 0.5.

   The subject of kicking a back-propagation network out of local minima
has barely been studied and there is no guarantee that the above methods
are very useful in general.

.ul
Setting the Algorithm to Use

   A number of different variations on the original back-propagation
algorithm have been proposed in order to speed up convergence.  Some
of these have been built into these simulators.  Some of the methods
can be mixed together.  The two most important choices are the
derivative term to use and the update method to use.  The default
derivative is the one devised by Fahlman:

.ce
0.1 + s(1-s)

where s is the activation value of the unit.  The reason for adding in
the 0.1 term to the correct formula for the derivative is that when s is
close to 0 or 1, the amount of error passed back is very small and so
learning is very slow.  Adding the 0.1 speeds up the learning process.
(For the original description of this method, see "Faster Learning
Variations of Back-Propagation:  An Empirical Study", by Scott E.
Fahlman, in \fIProceedings of the 1988 Connectionist Models Summer
School\fR, Morgan Kaufmann, 1989.)  Besides Fahlman's derivative and the
original one, the differential step size method (see "Stepsize Variation
Methods for Accelerating the Back-Propagation Algorithm", by Chen and
Mars, in \fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum, 1990) takes the
derivative to be 1 in the layer going into the output units and uses the
original derivative for all other layers.  The learning rate for the
inner layers is normally set to 1/10 the rate in the outer layer.  To
set the derivative, use the `A' command as in:

.ne4
   A do   * use the original derivative
   A df   * use Fahlman's derivative
   A dd   * use the differential step size derivative

   The algorithm command can contain other sub-commands besides the
setting of the derivative.  The other major choice is the update method.
The choices are the original one, the differential step size method,
Jacob's delta-bar-delta method, the continuous update method and the
continuous update method with the differential step size etas.  To set
these update methods use:

.na
.nf
.ne6
   A uo   * the original update method
   A ud   * the differential step size method
   A uj   * Jacob's delta-bar-delta method
   A uc   * the continuous update method
   A uC   * the continuous update method with the differential
          * step size etas
.ad
.fi

The differential step size method uses the standard eta when updates
are made to the units leading into the output layer.  For deeper layers,
another value will be used.  The default is to use an eta, called eta2,
for the inner layers that is one-tenth the standard eta.  These etas
both get set using the `e' command (not a sub-command of the `A'
command) as in:

.ce
e 0.5 0.1

The standard eta will be set to 0.5 and eta2 will be 0.1  If eta2
had been omitted, it would have been set to 0.05.  Jacob's
delta-bar-delta method uses a number of special parameters and these
are set using the `j' command.  Jacob's update method can actually be
used with any of the three choices for derivatives and the algorithm
will find its own value of eta for each weight.  The differential
step size derivative is often very effective with Jacob's
delta-bar-delta method.

   There are five other `A' sub-commands.  First, the activation
function can be either the piece-wise linear function or the original
smooth activation function, but the smooth function is only available
with the programs that use real weights and arithmetic.  To set the
type of function, use:

   A ap   * for the piece-wise activation function
   A as   * for the smooth activation function

The piece-wise function can save quite a lot in execution time despite
the fact that it normally increases the number of iterations required
to solve a problem.

   Second, it has been reported that using a sharper sigmoid shaped
activation function will produce faster convergence (see "Speeding Up
Back Propagation" by Yoshio Izui and Alex Pentland in the Proceedings of
\fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum Associates, 1990 ).  If we let
the function be:

                                1
                         ----------------,
                         1 + exp (-D * x)

increasing D will make the sigmoid sharper while decreasing D will
make it flatter.  To set this parameter, to say, 8, use:

.ce
A D 8  * sets the sharpness to 8

The default value is 1.  A larger D is also useful in the integer
version of back-propagation where the weights are limited to between
-32 and +31.999.  A larger D value in effect magnifies the weights and
makes it possible for the weights to stay smaller.  Values of D less
than one may be useful in extracting a network from a local minima
(see "Handwritten Numeral Recognition by Multi-layered Neural Network
with Improved Learning Algorithm" by Yamada, Kami, Temma and Tsukumo in
Proceedings of the 1989 IJCNN, IEEE Press).  Also, when you have large
input values, values of D less than 1 can be used to scale down the
activation to higher level units.

   The third miscellaneous command is the `b' command to control
whether or not to backpropagate error for units that have learned
their response to within a given tolerance.  The default is to
always backpropagate error.  The advantage to not backpropagating
error is that this can save computer time and sometimes actually
decrease the number of iterations that are required to solve the
problem.  This parameter can be set like so:

   A b+   * always backpropagate error
   A b-   * don't backpropagate error when close

   The fourth `A' sub-command allows you to limit the weights
that the network produces to some restricted range.  This can be
important in the programs with 16-bit weights.  These programs limit
the weights to be from -32 to +31.999.  When a weight near +31.999 is
increased a little it can overflow and produce a negative value.  When
one or more weights overflow, the learning usually takes a dramatic
turn for the worse, or on rare occasions, it suddenly improves.  To
have the program check for weights above 30 or below -30, enter: "A l
30".  This also limits the
absolute values of the weights to be less than or equal to 30.  The
weights are checked after they have been updated and if a weight is
greater than this limit, it is set equal to this limit.  The first time
this happens, a warning message is produced.  With this method, it is
possible, in principle, for a large weight change to cause overflow
without being caught, but this is unlikely.  To stop the weight
checking, set the limit to 0.  The default is to not check.

   The final miscellaneous `A' sub-command is `s', for skip.  Setting
s = n will have the program skip whole patterns for n iterations that
have been learned to within the required tolerance.  For example, to
skip patterns that have been learned for 5 iterations, use:  "A s 5".

.ul
Jacob's Delta-Bar-Delta Method and Parameters

   Jacob's delta-bar-delta method attempts to find a learning rate
eta, for each individual weight.  The parameters are the initial
value for the etas, the amount by which to increase an eta that seems
to be too small, the rate at which to decrease an eta that is apparently
too large, a maximum value for each eta and a parameter used in keeping
a running average of the slopes.  Here are examples of setting these
parameters:

.na
.nf
   j d 0.5    * sets the decay rate to 0.5
   j e 0.1    * sets the initial etas to 0.1
   j k 0.25   * sets the amount to increase etas by (kappa) to
              * 0.25
   j m 10     * sets the maximum eta to 10
   j t 0.7    * sets the history parameter, theta, to 0.7
.ad
.fi

These settings can all be placed on one line:

.ce
j d 0.5  e 0.1  k 0.25  m 10  t 0.7

The version implemented here does not use momentum.

   The idea behind the delta-bar-delta method is to let the program find
its own learning rate for each weight.  The `e' sub-command sets the
initial value for each of these learning rates.  When the program sees
that the slope of the error surface averages out to be in the same
direction for several iterations for a particular weight, the program
increases the eta value by an amount, kappa, given by the `k' parameter.
The network will then move down this slope faster.  When the program
finds the slope changes signs, the assumption is that the program has
stepped over to the other side of the minima and it is nearing the
minimum from the opposite side.   Therefore, it cuts down the learning
rate, by the decay factor, given by the `d' parameter.  For instance, a
d value of 0.5 cuts the learning rate for the weight in half.  The `m'
parameter specifies the maximum allowable value for an eta.  The `t'
parameter (theta) is used to compute a running average of the slope of
the weight and must be in the range 0 <= t < 1.  The running average at
iteration i, a\di\u , is defined as:

.ce
a\di\u = (1 - t) slope\di\u + ta\di-1\u,

so small values for t make the most recent slope more important than
the previous average of the slope.  Determining the learning rate for
back-propagation automatically is, of course, very desirable and this
method often speeds up convergence by quite a lot.  Unfortunately, bad
choices for the delta-bar-delta parameters give bad results and a lot of
experimentation may be necessary.  For more, see "Increased Rates of
Convergence" by Robert A. Jacobs, in \fINeural Networks\fR, Volume 1,
Number 4, 1988.

.ul
Recurrent Networks

   Recurrent back-propagation networks take values from higher level
units and use them as activation values of lower level units.  This
gives a network a simple kind of short-term memory, possibly a little
like human short-term memory.  For instance, suppose you want a network
to memorize the two short sequences, "acb" and "bcd".  In the middle of
both of these sequences is the letter, "c".  In the first case you
want a network to take in "a" and output "c".  Then take in "c" and
output "b".  In the second case you want a network to take in "b" and
output "c".  Then take in "c" and output "d".  To do this, a network
needs a simple memory of what came before the "c".

   Let the network be an 7-3-4 network where input units 1-4 and output
units 1-4 stand for the letters a-d.  Furthermore, let there be 3 hidden
layer units.  The hidden units will feed their values back down to the
input units 5-7, where they become input for the next step.  To see why
this works, suppose the patterns have been learned by the network.
Inputing the "a" from the first string produces some random pattern of
activation on the hidden layer units and "c" on the output units.  The
pattern from the hidden units is copied down to the input layer.
Second, the letter, "c" is presented to the network together with the
random pattern, now on units 5-7.
However, if the "b" from the second string is presented first, there
will be a different random pattern on the hidden layer units.  These
values are copied to units 5-7.  These values
combine with the "c" to produce another random pattern.  This random
pattern will be different from the pattern the first string produced.
This difference can be used by the network to make the response for the
first string, "b" and the response for the second string, "d".
The training patterns for the network can be:

     1000 000   0010  * "a" prompts the output, "c"
     0010 hhh   0100  * inputing "c" should produce "b"

     0100 000   0010  * "b" prompts the output, "c"
     0010 hhh   0001  * inputing "c" should produce "d"

where the first four values on each line are the normal input, the
middle three either start out all zeros or take their values from the
previous values of the hidden units.  The code for taking these values
from the hidden layer units is "h".  The last set of values represents
the output that should be produced.  To take values from the third layer
of a network, the code is "i".  For the fourth and fifth layers (if they
exist) the codes are "j" and "k".  Training recurrent networks can take
much longer than training standard networks.

.ul
Miscellaneous Commands

   Below is a list of some miscellaneous commands, a short example of
each and a short description of the command.

.IP "   ?   ?       " 15
A `?' will print program status information.

.IP "   !   !cmd    " 15
Anything after `!' will be passed on to UNIX as a command to execute.

.IP "   C           " 15
The C command will clear the network of values, reset the number of
iterations, set the seed to 0 and reset other values so that another
run can be made with a new seed value.

.IP "   E   E 1     " 15
Entering "E 1" will echo all the input.  "E 0" will stop
echoing command input.  The default is to not echo input, since it
appears on the screen automatically.  Echoing input is useful when
commands are taken from a file of commands, using the `i' command
described below.  It can also be useful when reading commands from
a file when there is some kind of error within the file.

.IP "   i   i f     " 15
Entering "i f" will read commands from the file, f.  When there are
no more commands on a file, the program starts reading from the
keyboard.  (Its very handy to have a set of fixed commands in a file
to, in effect, create a new command.)

.IP "   l   l 2     " 15
Entering "l 2" will print the values of the units on layer 2,
or whatever layer is specified.

.IP "   T   T -3   " 15
In sbp and srbp only, "T -3" sets all the threshold weights
to -3 or whatever value is specified and freezes them at this value.

.IP "   W   W 0.9   " 15
Entering "W 0.9" will remove (whittle away) all the weights with
absolute values less than 0.9.
.in-15

In addition, when a user generated interrupt occurs (by typing DEL)
the program will drop its current task and take the next command.

.ul
Limitations

   Weights in the bp and sbp programs are 16-bit integer weights, where
the real value of the weight has been multiplied by 1024.  The integer
versions cannot handle weights less than -32 or greater than 31.999.
Weights are only checked if the Algorithm parameter, l, is set to a
value greater than 0.  Large learning rates with the differential step
size derivative and using the continuous update method can produce
overflow.  There are other places in these programs where calculations
can possibly overflow as well and none of these places are checked.
Overflow seems highly unlikely, in these other places, however.  Input
values for the integer versions can run from -31.994 to
31.999.  Due to the method used to implement recurrent connections,
input values in the real version are limited to -31994.0 and above.
