
        Documentation for ShowReal                             Page 1




        This program and text file is designed to show how a typical
        floating point format is used by a high-level language.  The
        intent is to illustrate the principles, not to suggest that
        this is _the_ format which is somehow superior to all others.
        The details vary, but once you understand how one format is
        used; it is relatively easy to transfer that knowledge to
        another format.  A pencil and paper analysis is difficult at
        best.  There are three number bases to contend with, binary,
        hexadecimal for its' brevity, and decimal since that is the
        only format most of us are comfortable with for computation
        of anything except two or three digit numbers.  Also, there
        are two signs to contend with and one of them is expressed in
        an obscure fashion.  And we are dealing with binary fractions,
        not a widely used form of expression.

        The native form of number representation for a computer is a
        fixed point integer.  A series of bits with an implied radix
        point (see end note 3) at the right end of the number.
        Numbers containing a decimal point can be handled by such
        methods, but with extreme effort on the part of the
        programmer.


        Useful high level languages always support computation with
        floating point numbers.  Within the language syntax this is
        usually referred to as 'real' or 'float' or something of that
        sort.  (Real means any rational or irrational number in
        addition to its' special meaning within complex numbers.)
        They permit the processing of both quite large and quite
        small numbers with little effort on the part of the
        programmer.  A common range of values is 10^-38 to 10^+38
        (see end note 1). On personal computers of the ST class, the
        details are handled by software subroutines.  On personal
        computers with a math co-processor, the co-processor chip
        implements hardware routines for speed.  Large scale
        mainframes have floating-point routines as part of their
        intrinsic repertoire.  They often have multiple formats
        supporting single precision, double precision, and even
        extended precision.  The memory space used for a single
        variable ranges from as little as 32 bits on the IBM system
        360 and its' progeny to over 100 bits.

        The formats used are, in general, ad hoc and tapes and disks
        using floating point can not be interchanged between systems
        at will.  There are standards but they aren't in general use,
        as say ASCII is in general use for the personal computer
        character set.  The format studied here is the one used by
        Personal Pascal (PP).


        Personal Pascal uses six bytes to store one real variable.
        The first five bytes are the mantissa (see end note 2), and
        the sixth is the exponent.  If we let m stand for mantissa
        and e for exponent, a number is expressed as m * 2^e.  Since
        a left shift of a number is equivalent to multiplying by 2
        and a right shift is the same as dividing by two, the scaling





        Documentation for ShowReal                             Page 2


        of a number can be changed quite rapidly.  The mantissas are
        _normalized_.  That means that the leftmost bit is always 1.
        Real numbers always have at least one 1 bit.  How is that
        possible? By fiat, a zero is _defined_ as being represented
        by all zero bits.  But this is not a part of the mathematics.
        The smallest positive number is, say, +1 * 10^-38; the
        smallest possible negative number is -1 * 10^-38.  And that
        hole is _defined_ to contain zero.  So dividing by all zero
        bits should yield an error, but dividing by 1 *10^-38 is
        perfectly valid.   Special tests have to be made by the
        hardware or software to detect a variable with an all zeros
        _exponent_ (in the case of PP) and give such numbers special
        handling.

        So real numbers always have at least one 1 bit. And before
        putting a number back in RAM storage the leftmost one bit is
        shifted to the left boundary of the mantissa, a process
        called normalizing.   Since it is known, a priori,  that this will
        happen, PP replaces the leftmost bit with the sign bit for
        the composite number.  This sounds a little like black magic
        and you may have to think about that for a while.

        The exponent (sometimes called the characteristic and
        sometimes the scale factor) indicates how many places the
        mantissa must be shifted to put the radix point  'where it
        belongs'.  Say we have an eight-bit mantissa:   .10100010  An
        exponent of 3 would mean that the number being represented is
        actually : 101.00010. That is 4 + 1 + .0625 = 5.0625.  A
        negative exponent means that leading zeros should be added.

        The only numbers, x, that can be represented without a shift
        in the radix point are:
                            0.5 >= x < 1.0
        These are the binary fractions representing
           2^-n: .5, .250, .125, ... and the sums of those numbers.
        Note that the sum can never reach 1.0.

        In order to add or subtract, the two numbers must be scaled
        so that their radix points are aligned.  The larger
        (absolute) number dominates the process.  This is a possible
        source of problems and we will return to this.  For
        multiplication or division, the numbers can be operated on
        immediately and the new exponent computed after the fact.
        Then the new number is normalized before it is saved in
        RAM.

        Exponents are referred to as having a bias added, in the case
        of PP this is $80 or 128 decimal.  An exponent of 128 then
        indicates no shifting is required.  The radix point is
        already in the proper place and the leftmost bit has a value
        of 2^-1 , representing .5.  So the sign bit is embedded in
        the number more or less.  Exponents greater than 128 indicate
        a right shift of the radix point and exponents less than 128
        indicate a left shift (leading zeros must be added, at least
        conceptually.)

        When you run the program you will be asked to provide a
        number.  Numbers of about -95 to -99 are reserved for special





        Documentation for ShowReal                             Page 3


        purposes by the program and we will get back to them later.
        You can type in any other number that satisfies the PP syntax
        and error checks.  The program will 'explode' the number into
        its' constituent parts.  The top line (the top after the
        permanent display) of the screen shows the result of Pascals'
        input/output cycle of conversion.  The number was read from
        the keyboard, converted to real format, stored, retrieved,
        converted to ASCII, and displayed on the monitor.  Remember
        that this is a 'live' demonstration; nothing is dummied up or
        simulated.  Normal Pascal error checking is activated.
        Ignoring blank lines and the constant part of the display,
        the lines have the following content.

             o Line 1  Pascal output,
                       raw unaltered content of variable (hex)
             o Line 2  Mantissa on left, exponent on right
             o Line 3  Sign removed from mantissa, replaced by 1 bit
                       Exponent bias subtracted.  Result shown in
                       both hexadecimal and decimal
             o Line 4  Mantissa expressed in binary with a properly
                       placed radix point.  Leading zeros added for
                       small numbers; trailing zeros for large numbers.

        I suggest trying these numbers first: 1, 2, 4 , 20, .5, .25,
        .125, .75, 20.75. That should give you some good insight into
        what is going on.  For example 20.75 is represented by
        2^4 + 2^2 + 2^-1 + 2^-2 or  10100.11  .

        Now try .0625 and .0626. The number .0625 is represented
        exactly as 2^-4.  When you change to a slightly larger figure
        such as .0626 you see a splattering of several low order
        bits.  The sum of all those bits is just the additional
        .0001.

        Then try 1e38, 1e-38 and 0.  You can try as many numbers as
        you like, typing -99 ends the session.  I have spent hours
        trying to compute these results manually for different
        systems, a very unrewarding effort.  You need the mind of a
        bookkeeper to get the right answer.

                                 An Anomaly

        The biggest number that can be expressed would have all 1
        bits in the mantissa and an exponent of 2^127.  The mantissa
        contains the sequence .5 +.25 + .125 +... adding up to something
        with a lot of nines, say .99999.   So the largest number is
        .99999 * 2^127 ~ 1.70141.  Try typing that.  It works fine.
        Now try typing 1.70142 and look at the _Pascal_ conversion on
        the screen.  The wrong number.  If you input 1.9e38, an error
        will be detected.  There is a small hole in the error
        detecting logic.  This is not meant to be critical of
        Personal Pascal; I wouldn't be at all surprised to find
        similar - or worse - errors on many, if not most, programming
        packages.   The language has nothing to do with it of course.
        On any machine with an 8-bit exponent, a very common value,
        treat any number greater than 1.0e38 or smaller than 1.0e-38
        with extreme caution!






        Documentation for ShowReal                             Page 4


                                  Precision

        After running the program a few minutes you should have a
        good grasp for what is going on.  Note that if a small number
        is added to a large number the large number may be modified
        by only a few of the most significant bits from the small
        number, or not at all!  The same thing is true for
        subtraction.  The right answer could be _computed_ but there is
        no way to store the result with this format.  And a larger
        format would mean not only more memory is used, the program
        would also run slower.  The format used as an example is a
        good compromise between speed, cost and usability of the
        results.  But the programmer should be on guard for this
        phenomenon of truncation error.  Given a choice, the program
        should combine small numbers with each other and build up to
        the larger numbers.  A programmer is often aware of the
        relative size of the numbers in a general way and can plan
        ahead.  If he isn't aware, he has a comparison capabilty in
        his language!

        The format illustrated here has 39 bits for mantissa data.
        Remember that one of the 40 bits is a sign bit.  The
        designers say that it provides 11 decimal-digits of
        precision, which strikes me as a conservative claim.  By way
        of a comparison ANSI BASIC only requires six digits of
        precision, hardly enough for serious work.

        Keep in mind that any floating point scheme, no matter how
        large the mantissa, can not represent most numbers exactly.
        The number will be represented by a sum of binary fractions.
        Thus, .5, .25 and their sum .75 can be represented exactly.
        But .756 may require an extremely large number of bits to get
        an exact representation, and it may not even be possible.
        Using the PP format I wrote a test program to add .1 to
        itself 10,000 times.  The sum was 999.99999852 instead of
        1000.  Doing the same thing with .3 yields 300.00000045, the
        number was rounded up on the input conversion so the sum is
        too large.  Neither .1 or .3 was represented exactly. in 39
        bits!  The moral is that you may want to think twice before
        you write, say, a payroll program using floating point
        numbers. People tend to get very up tight about discrepancies
        in amounts of money.

                              Range of Numbers

        The program has provisions built into it to generate four
        special cases.  Entering the following codes will cause the
        program to generate a number satisfying the situation on the
        same line as the code.

             Code           Number Generated
              -99           None, exit the program
              -98           Smallest positive number
              -97           Biggest positive number
              -96           Smallest negative number,
                            i.e., about -1e-38
              -95           Most negative number,
                            i.e, about -1e+38





        Documentation for ShowReal                             Page 5



        I didn't try to enter these directly as a Pascal variable;
        but I would expect trouble on some of them.  So the program
        creates the binary number of interest and then has Pascal
        show the result after output conversion.

        The biggest positive number is probably the easiest case to
        understand.  The mantissa must be all 1s so that the sum of
        those binary digits is as large as possible.  The exponent
        should be as large as possible, its' contained in an 8-bit
        byte so that number is 255 or all 1 bits too.  The number is
        positive so the sign bit must be 0. We end up with, using
        hexadecimal notation: $7FFFFFFF FF

        The smallest positive number will have a single 1 bit at the left
        end of the mantissa. But it is replaced by a  sign bit.  Since
        this is a positive number the sign bit is 0.  So the mantissa
        is all 0 bits!  Now do you see why I wrote this program?? The
        exponent must be as small as possible.  But 0 is reserved to
        define the _composite number_ whose value is 0, not a
        floating point number at all.  So the smallest exponent left
        for us is 1.  So the number is $000000 01.

        The extreme numbers, after conversion by Pascal are:
         -1.7014118346E+38     -2.9387358771E-39   (zero)     ...
                  ...   +2.9387358771E-39      1.7014118346E+38  .

        This range is not really large enough for some work,
        especially in the field of probability, since a number as
        'small' as 34! causes an overflow.  You can, of course,
        resort to logarithms to handle numbers with a large range.

        Note that a scientific calculator offers a range of - 10^99
        to + 10^99 and perfect accuracy.  It doesn't use floating
        point, rather it expresses numbers in binary coded decimal
        (BCD) form.  Such routines could be provided in a computer
        (after all the calculator _is_ a computer) but they would be
        very slow compared to what can be done with floating point.
        On a calculator, the human being is so slow that any time the
        calculator takes is hardly noticed.

        All this talk about range and precision is not meant to leave
        you with a negative feeling about floating point numbers.
        They are essential and perfectly usable for an extremely
        large range of problems.  But the programmer must be aware of
        what the limitations are and how some of them can be
        minimized.  And that is, partly at least, the subject of this
        document.

                                  A Warning

        The program will crash if you enter a number that is too
        large or ,perhaps, too small.  If that happens your system
        will be left in 'automatic overflow on' mode; which was used
        by the program to produce the screen wrap for large binary
        mantissas.  Simply run the program one more time and do a
        clean exit and your system will be returned to a normal
        state.





        Documentation for ShowReal                             Page 6



                                   A Wish

        The lack of a set of floating point routines make it doubly
        difficult to do assembly language programming in the field of
        mathematics. To make a set of routines from scratch would be
        a massive undertaking.  You would need, as a minimum, input
        conversion, add, subtract, multiply, divide, sin(x), cos(x),
        arctan(x), ln(x), exp(x) and output conversion.  And the only
        trivial routine in the list is that add is a derivative of
        subtract.  Presumably some people are 'borrowing' routines
        from a high level language (after all they are in assembly
        language or C) but this would be a violation of the copyright
        on the language.  Anyway, I would love to see a set of
        floating point routines available for use with assembly
        language.

                               And Finally ...

        The Pascal source is included.  It illustrates how a Pascal
        program, when it wishes, can easily circumvent the type
        checking make by the compiler.  Note the type declaration of the
        record T723.  It uses a 'free type-union' which makes it
        impossible for the compiler to check for type consistency.
        In contrast with this usage, the normal variant record uses a
        'discriminated type-union'.

        This archived file may be freely reproduced and propagated
        providing that the content is complete and unaltered.

                    Placed in public domain in July 1991,
                             by Merlin L. Hanson































        Documentation for ShowReal                             Page 7



                              *** End Notes ***

        1. Why a strange number like 10^38?  The exponent is
        represented by an 8-bit byte which allows  256 combinations.
        These combinations are allocated into 127 large numbers and
        127 small numbers and 127 + 127 = 256, the number of
        combinations represented by an 8-bit byte. And 2^127 = 10^38.
        (Obviously using the '=' sign rather loosely.)

        2. I am as unhappy with the naming of these bits the mantissa as you
        are.  There is no involvement with logarithms here and I can
        find no definition of mantissa which would justify their use as used
        here.  And of course 'characteristic' also comes from the
        field of logarithms.

        3. Radix point is the generic term.  A decimal point is the
        radix point for decimal numbers.












































