Design notes for FPLIB Release 2.0
==================================
NUMBER REPRESENTATION
---------------------
The floating point format used is Motorola's "Fast Floating Point":
-----------------------------------------
| 24 bits mantissa |S| 7 bits exp |
-----------------------------------------
The mantissa has an implicit binary point just off to the left. The
exponent is binary, stored excess-64. In other words, the actual
mathematical exponent has 64 added, and the resulting 7-bit number is
unsigned.
Negative numbers are stored as sign-and-magnitude.
All floating-point operations expect normalized numbers, and generate
normalized numbers if the inputs are normalized. A number is
normalized if the most significant bit is 1 (except for 0.0, which is
stored as 32 bits of zero, never negative). Also, a nonzero mantissa
with a biased exponent of 0 is not normalized, letting us test the low
byte or whole long as a test for 0.0. This all means that the most
significant bit is actually redundant, and could be left off, giving
one extra bit of precision. Oh well.
Some of these routines make use of the assumption that all arguments
are normalized, and may generate slightly bogus results if that isn't
true. Release 1.0 assumed that a biased exponent of zero could have
a nonzero mantissa; that has been fixed.
Some examples: 1.0 is hex 80000041 , or 0.5 * 2 ** 1 (binary 0.1,
exponent 1 + bias of 64). -0.75 is hex C00000C0 (note the sign bit is
inverted, but no other bit is).
ACCURACY AND LIMITS
-------------------
The accuracy is in the 24-bit mantissa, which gives between 7.22 and
7.52 decimal digits of relative error. Decimal constants are given to
8 digits, and the compiler is trusted to convert them right (it
generally does a reasonable job).
The range is in the exponent. The smallest nonzero and largest
positive normalized numbers are ~5.4e-20 and ~9.2e18.
Rounding of ambiguous numbers uses round-to-even (see D. E. Knuth's
Seminumerical Algorithms for a discussion on this). This means that
if the trailing portion is exactly one half of the least significant
bit, the lsb is forced to be the nearer even number. We often keep
several insignificant bits to be sure this is done correctly.
CODING TRICKS
-------------
It's quite normal in a library of this type to resort to bit-twiddling
in the internal representation. To help this, FPLIB.H defines a C
union, thus:
union {
unsigned long ul;
float f;
char sc[4];
};
The number can be looked at as a floating or integer long (for handling
bits in the mantissa), the sign is the sign of sc[3], and the exponent
is the rest of sc[3]. Other shortcuts are commented in the inividual
routines.
MATHEMATICAL INFORMATION
------------------------
Well, this is mostly what I remember from my undergraduate days. Many
of these functions have a precise mathematical formula or algorithm
that will give the result, but is too slow for practical use. Instead,
we look for an approximation that is quick to compute, and gives us
just enough precision and no more. After all, why compute to 100
digits accuracy when you can only represent less than 8? Some of these
approximations look very strange, but they all have sound mathematical
bases. The name Chebyshev springs to mind.
Dedicated mathematicians have struggled for decades to bring you these
formulae and, just as important, a predictable accuracy of better than
7.5 decimal digits. Because of this, we can execute a known number of
operations instead of loop-and-test (as you would in, for example, a
simple Newton's iteration for square root).
The approximations are generally polynomials, usually avoid division, and
are valid only over a small range of the input parameter. We use simple
mathematical identities to extend the range. For example, the cyclic
nature of the trigonometric functions means we only need calculate in the
range 0..pi/2. For another example, we can do much of the square root
calculation by the simple expedient of dividing the binary exponent by 2.
RANDOM LIBRARY INFORMATION
--------------------------
Release 1.0 of Sozobon C had a few problems in the floating support:
- Two negative floating numbers compared wrong (returned > when it
should be <). I corrected that. However, the other routines don't
assume the correction has been installed in the library (which it
wasn't, at first...).
- The library was not optimally ordered, and had to be scanned twice.
I straightened that out.
- Add and subtract were truncating. Rounding is generally a better idea.
atof() calls sscanf(). Ick.
-------------------------------------------------------------------------------
These routines may be used without restriction, but I retain the copyright
on them, and the Only True Upgrades will be those I make. If you have any
improvements, please send them to me.
March 1989
David W. Brooks
36 Elm Street
Medfield, MA 02052
BROOKS@CSSS-A.PRIME.COM
{uunet,mit-eddie}!csss-a.prime.com!brooks