Path: witch!uupsi!psinntp!uunet!noc.near.net!howland.reston.ans.net!darwin.sura.net!emory!nntp.msstate.edu!olivea!charnel!rat!decwrl!usenet.coe.montana.edu!netnews.nwnet.net!news.u.washington.edu!glia!jfoy
From: jfoy@glia.biostr.washington.edu (Jeff Foy)
Newsgroups: alt.lang.basic
Subject: BASIC FAQ (as requested)
Date: 11 May 93 01:34:52 GMT
Organization: University of Washington
Lines: 939
Message-ID: <jfoy.737084092@glia>
NNTP-Posting-Host: glia.biostr.washington.edu

REPLY-TO: jfoy@glia.biostr.washington.edu




                                QUIK_BAS FAQ2.1

        ****************************************************************
        *                                                              *
        *                    "Ask Doctor Jackson"                      *
        *     The QUIK_BAS List of Frequently Asked Questions with     *
        *             Some Simple Public Domain Solutions              *
        *                                                              *
        ****************************************************************

                                Written by
                            Quinn Tyler Jackson
                      The QUIK_BAS "Keeper of the FAQ"
                                with source
                               samples  from
                             "Various Sources"

TABLE OF CONTENTS:

        q1.0    The BASICS of BASIC
                s1.0    QUIKSORT.BAS    -- recursive quicksort SUB

        q2.0    Commonly Requested Routines
                s2.0    HUTHSORT.BAS    -- iterative quicksort SUB
                s3.0    BISEARCH.BAS    -- binary search FUNCTION

        q3.0    Advanced Topics         -- "Hashing in QuickBASIC"
                t1.0    Hashing Collision Table
                s4.0    FSTPRIME.BAS    -- generates 4K+3 prime number
                t2.0    List Management System Ratings
                s5.0    WORDHASH.BAS    -- word distribution counter

        q4.0    Structured BASIC Techniques


NOTE:   Neither Quinn Tyler Jackson nor his company, JackMack Consulting
        & Development accepts responsibility for the soundness of the
        following advice.  Certain pieces of other people's public domain
        code has been spliced into this FAQ for purely demonstrative
        purposes, and it is by no means to be assumed that Quinn Tyler
        Jackson is claiming ownership of such pieces of source code.
        All source remains the property of those who originally wrote it,
        as understood by Canadian, American, and International Treaty.

        The text portion of this file itself is hereby released into the
        "Public Domain" for the purposes of education and enlightenment.


Q1.0    The BASICS of BASIC:

Q1.4    Okay, Quinn, I've figured out FUNCTIONs and SUBs, and have even
        started using them with some kind of skill.  Now, thing is, I
        come up to this thing called 'recursion.'  What's this all about,
        and can you show me some practical application of it?

A1.4    There is an old joke about the cryptic nature of dictionaries that
        goes something like this:

        re'CUR'sion (noun) 1. see recursion

        Actually, that's a pretty sad joke.  One computer scientist's
        definition states:

        "... a recursive algorithm is one that contains a copy of itself
        within one of its instructions.  Thus, a recursive algorithm is
        reminiscent of a set of mirrors in which you can see yourself
        looking at yourself looking at yourself."  [J. Glenn Brookshear]

        Recursion is a powerful programming tool, and any comprehensive
        programming language allows it.  QuickBASIC and its dialects are
        no exception.  A simple example of recursion:

        SUB recurse
        recurse
        END SUB

        This thing will go in circles until the stack is full, crashing
        the program should it ever be called.  It illustrates two of the
        main pitfalls of recursion:

                1. recursion in QuickBASIC eats the stack for breakfast
                2. there must be a terminating condition to exit the loop

        Since each call to a SUB or FUNCTION does some pushing to the
        stack, it must always be remembered that recursive routines will
        require a bit of the stack for every instance they are called.
        It is sometimes hard to know in advance how many times a recursive
        routine will end up calling itself, and therefore, one cannot
        know with any accuracy how much a given recursive routine will
        decide to rob from the stack.  Be warned!

        This also leads to the next issue: there must ALWAYS be a
        terminating condition to exit the loop.  Sometimes it is easy to
        overlook this point.  Consider the above simple example.  It
        never stops calling itself, does it?  Were a theoretical computer
        to exist that had a theoretically infinitely large stack that could
        never be consumed by even the deepest level of recursion, what
        happens if that routine goes off into a corner and keeps calling
        itself?  It results in a permanent time out known as a crash.
        (The moral of this?  A bug on a i486 system is still a bug, just
        a bug that happens sooner.)

        An example of a terminating condition added to the above code:

        SUB recurse(n%)
        n% = n% + 1
        IF n% < 10 THEN
                recurse
        END IF
        END SUB

        This SUB will call itself only until n% is equal to ten, at which
        point, it will reach its terminating state, and be finished on its
        job.  This is a simple example, I admit, but NEVER forget to
        include a terminating statement in your recursive routines, or
        you will pay for it with a crash.

        Now that we have that out of the way, let's kill two birds with one
        stone.  (It could be argued, in fact that the act of killing two
        birds with only one stone probably involves recursion somewhere in
        the solution.)  Everyone wants to know a good QuickSort algorithm,
        and most implementations of that use recursion.  So, a modified
        version of the QuickSort SUB from Microsoft, one that sorts
        an array passed to it:

S1.0    QUIKSORT.BAS [F210S01.BAS]

DEFINT A-Z
SUB QuickSortSTR (Array() AS STRING, Low, High)
'            /^\              /^\
'             |                |
'    Change these to any BASIC data type for this routine to
'    handle other types of data arrays other than strings.
'
'============================== QuickSortXXX ================================
'  QuickSortXXX works by picking a random "pivot" element in Array(), then
'  moving every element that is bigger to one side of the pivot, and every
'  element that is smaller to the other side.  QuickSortXXX is then called
'  recursively with the two subdivisions created by the pivot.  Once the
'  number of elements in a subdivision reaches two, the recursive calls end
'  and the array is sorted.
'============================================================================
'
'            Microsoft's source code modified by Quinn Tyler Jackson
'
STATIC BeenHere

IF NOT BeenHere THEN
        Low = LBOUND(Array)
        High = UBOUND(Array)
        BeenHere = -1
END IF

DIM Partition AS STRING  ' Change STRING to any BASIC data type
                         ' for this QuickSort routine to work with
                         ' things other than strings.

   IF Low < High THEN

      ' Only two elements in this subdivision; swap them if they are out of
      ' order, then end recursive calls:
      IF High - Low = 1 THEN ' we have reached the terminating condition!
         IF Array(Low) > Array(High) THEN
            SWAP Low, High
            BeenHere = 0
         END IF
      ELSE

         ' Pick a pivot element at random, then move it to the end:
         RandIndex = INT(RND * (High - Low + 1)) + Low
         SWAP Array(High), Array(RandIndex)
         Partition = Array(High)
         DO

            ' Move in from both sides towards the pivot element:
            I = Low: J = High
            DO WHILE (I < J) AND (Array(I) <= Partition)
               I = I + 1
            LOOP
            DO WHILE (J > I) AND (Array(J) >= Partition)
               J = J - 1
            LOOP

            ' If we haven't reached the pivot element, it means that two
            ' elements on either side are out of order, so swap them:
            IF I < J THEN
               SWAP Array(I), Array(J)
            END IF
         LOOP WHILE I < J

         ' Move the pivot element back to its proper place in the array:
         SWAP Array(I), Array(High)

         ' Recursively call the QuickSortSTR procedure (pass the smaller
         ' subdivision first to use less stack space):
         IF (I - Low) < (High - I) THEN
            QuickSortSTR Array(), Low, I - 1
            QuickSortSTR Array(), I + 1, High
         ELSE
            QuickSortSTR Array(), I + 1, High
            QuickSortSTR Array(), Low, I - 1
         END IF
      END IF
   END IF
END SUB

=======>8 SAMPLE 1.0 ENDS HERE 8<=========

Q1.5    So that's how to use recursion!  That's great!  I think I'm
        starting to get a hang of things with QuickBASIC now, thanks.
        But, how is it possible for it to call itself over and over
        like that without all those variables interfering with
        each other?  I mean, I'm kind of used to GW-BASIC, and well,
        I just can't figure out why all those High and Low variables
        don't just write over one another.  My docs say something about
        local and global scope, but it's all kind of confusing.  What's
        the real difference between local, STATIC, COMMON, SHARED, COMMON
        SHARED, and all other flavors of variables?

A1.5    Beginners with QuickBASIC sometimes have a hard time decrypting
        all of the different types of variable scope.  Microsoft hasn't
        really helped anything with all the funny names for variable
        scope.  GLOBAL would have made more sense than SHARED for most.
        Okay, let's look at how the QuickBASIC program is inevitably
        structured:

                1.  First, there is the 'module' level.  That is the
                    main part of the QuickBASIC program, the part where
                    execution starts, and most programmers declare their
                    constants, and put their main documentation.

                2.  Second, there is the SUB and FUNCTION level.  Each
                    SUB and FUNCTION could be thought of as a miniprogram
                    unto itself.  That's why SUBs are called that:
                    subprogram.

                3.  Third, if you write bigger programs, you may actually
                    have two or more modules, each one having its own
                    SUBs and FUNCTIONs.

        Okay, then, any variable used at the modular level, or level 1, is
        accessible, or in the 'scope' of the modular level.  If there is
        a variable called Foo at the modular level, with a value of 7, then
        any Foo at the SUB or FUNCTION level could also be called Foo,
        without interfering with the modular Foo.  Think of each module
        level variable and each SUB and FUNCTION variable as being on
        different continents.  They can have the same name with no problem.

        But, suppose you want a SUB or FUNCTION to have access to the Foo
        that was declared at the modular level.  This is where the SHARED
        declarator comes in.  In the SUB somesubprog, to have access to
        the Foo that was declared at the modular level, just add the
        declaration:

        SHARED Foo

        Any SUB or FUNCTION that doesn't want to have access to the
        modular Foo doesn't have to declare it as SHARED.  This is a
        powerful feature, once you get the hang of it and feel confident
        enough to use it wisely.

        Now, suppose that you want a number of your SUBs or FUNCTIONs to
        have access to a common group of variables.  At the modular level,
        the declaration would be:

        DIM SHARED Foo

        This would give ALL of the SUBs and FUNCTIONs of a given module
        access to the variable Foo.  Any access of Foo at any level will
        alter the global variable.

        Now, suppose you have a multimodule program that has FIRST.BAS and
        SECOND.BAS linked together.  Suppose you want them to communicate
        with one another via a common global variable.  This is where
        COMMON SHARED comes in.

        Now that we've covered this, there is the issue of the STATIC
        declarator.  Normally, variables at the SUB and FUNCTION level
        are dynamic, which means they disappear when the routine returns
        to the place that it was called from.  By declaring a variable
        STATIC, we can be assured that whatever the variable's value was
        when we left, it will be when we return.  To declare only a few
        of the variables as STATIC, use the form:

        SUB FooSub ()
        STATIC Variable1, Variable2, etc.
        :
        :
        END SUB

        But, if you want ALL the variables to be STATIC, use the following
        method:

        SUB FooSub () STATIC
        :
        :
        :
        END SUB

        There are certain speed advantages to STATIC SUBs and FUNCTIONs,
        since variables are not created on the stack, but that is a more
        advanced issue.

        So, in summary:

        1.  SHARED allows SUBs and FUNCTIONs to use modular variables,
        2.  COMMON allows modules to share variables between themselves,
        3.  STATIC allows variables to retain their value between
            calls to the SUB or FUNCTION in question.

Q2.0    Commonly Requested Routines:

Q2.4    Okay, I've looked the whole thing over, Quinn, and I've realized
        something: the recursive QuickSortXXX routine eats the stack up
        pretty fast.  Is there another way?  Is there a way to implement
        a QuickSort SUB without using recursion?

A2.4    Yes, indeed there is.  Cornel Huth implemented an iterative
        quicksort algorithm, which I then tweaked a bit.  It is actually
        a bit faster than the other, and doesn't use too much of the stack.
        It accomplishes this by using an array to simulate a stack.  The
        modified version follows:

S2.0    HUTHSORT.BAS [P210S02.BAS]

' HUTHSORT.BAS written by Cornel Huth
' Iterative QuickSort Routine
'
' Tweaked by Quinn Tyler Jackson

SUB subHuthSortSTR (Array() AS STRING)
'               ^  TWEAK THESE    ^
'               | FOR OTHER TYPES |
'               `--+--------------'
'                  V
  DIM compare AS STRING

TYPE StackType
  low AS INTEGER
  hi AS INTEGER
END TYPE

DIM aStack(1 TO 128) AS StackType

  StackPtr = 1
  aStack(StackPtr).low = LBOUND(Array)
  aStack(StackPtr).hi = UBOUND(Array)
  StackPtr = StackPtr + 1

  DO
    StackPtr = StackPtr - 1
    low = aStack(StackPtr).low
    hi = aStack(StackPtr).hi
    DO
      i = low
      j = hi
      mid = (low + hi) \ 2
      compare = Array(mid)
      DO
        DO WHILE Array(i) < compare
          i = i + 1
        LOOP
    DO WHILE Array(j) > compare
          j = j - 1
        LOOP
        IF i <= j THEN
          SWAP Array(i), Array(j)
          i = i + 1
          j = j - 1
        END IF
      LOOP WHILE i <= j
      IF j - low < hi - i THEN
        IF i < hi THEN
          aStack(StackPtr).low = i
          aStack(StackPtr).hi = hi
          StackPtr = StackPtr + 1
        END IF
        hi = j
      ELSE
        IF low < j THEN
          aStack(StackPtr).low = low
          aStack(StackPtr).hi = j
          StackPtr = StackPtr + 1
        END IF
        low = i
      END IF
    LOOP WHILE low < hi
    'IF StackPtr > maxsp THEN maxsp = StackPtr
  LOOP WHILE StackPtr <> 1
END SUB

=======>8 SAMPLE 2.0 ENDS HERE 8<=========

Q2.5    Now that I've got so many neat ways to sort a list, I'd sure like
        to be able to locate an entry in it quickly.  I hear that a binary
        search is fast, but I just can't figure out how to do that.  How
        do I do a binary search?

A2.5    Binary searches are the fastest overall search method for standard
        sorted lists.  Such lists can be divided in two, looked at, and
        divided again as necessary.  A good search method is demonstrated
        here:

S3.0    BISEARCH.BAS [F210S03.BAS]


DEFINT A-Z
FUNCTION BiSearchSTR (Find AS STRING, Array() AS STRING)

Min = LBOUND(Array)             'start at first element
Max = UBOUND(Array)             'consider through last

DO
  Try = (Max + Min) \ 2         'start testing in middle

  IF Array(Try) = Find THEN     'found it!
    BiSearch = Try              'return matching element
    EXIT DO                     'all done
  END IF

  IF Array(Try) > Find THEN     'too high, cut in half
    Max = Try - 1
  ELSE
    Min = Try + 1               'too low, cut other way
  END IF
LOOP WHILE Max >= Min

END FUNCTION

=======>8 SAMPLE 3.0 ENDS HERE 8<=========

Q3.0    Advanced Topics -- "Hashing in QuickBASIC"

Q3.1    That's pretty fast!  I was so used to doing a sequential search on
        an unsorted list.  Now that I have the QuickSort and the BiSearch
        routines, I can use them as a pair for faster list searches.  The
        thing is, as soon as I want to add something to the list, it
        puts everything out of order by only one entry, and that hardly
        seems worth sorting all over again, even with something as fast
        as Cornel Huth's iterative QuickSort algorithm.  Are there any
        alternatives to this way of doing things?  I've heard talk of
        something called 'hashing' but I don't have any idea of what that
        is all about.  How would I use hashing to avoid having to either
        resort the list, or use a slow insertion algorithm?  Insertion is
        horrendously slow with disk files.

A3.1    Hashing is a very efficient method of record access, be it in RAM
        or be it with a disk file.  Basically, hashed arrays or data files
        can be quickly searched for a given item by a key index.  Whenever
        you have to add an item to the list, you can at lightening speed,
        and since hashing "sorts" the array on-the-fly, as it were, there is
        no need to push records around to add new items to a hashed record.

        The first concept you must understand with hashing is the key index.
        Every data structure you design with hashing in mind has to have
        one field that is unique.  This is a prerequisite that you just can't
        get around.  Of course, you could actually combine several fields
        to generate this unique key, which effectively serves the same
        purpose.  A good application of this is a Fidonet nodelist that uses
        the node address as the hashing key.  No two alike in theory.

        But just how does this key work?  First of all, let's take a look
        at the Fidonet example.  Every full Fidonet address is unique to
        one node.  Assume that the full nodelist has about 15000 entries.
        Okay, if you want a hashing table to hold 15000 unique entries, then
        research has shown that the table should be at least 30% greater
        than the number of entries in it.  That would make 19500 table
        entries.  This means that 4500 entries in the list will be left
        empty for best hashing results.

        Now, another problem comes up.  How does the key come into play?
        Well, let's look at a simple key: 1153999.  Since the list is 19500
        long, we certainly can't just put this in record 1153999.  Hashing
        involves dividing the key by the table size and taking the remainder
        and using that as the record number:

                           59
                    ----------  R 3499
               19500) 1153999


        Okay, 3499 is the record number in which we would put the data.
        This is the basic idea behind hashing.  There is a trouble, however.
        Collision occurs whenever a node address, when divided by 19500 has
        a remainder of 3499.  That 'bucket' is already full!  So, what to
        do?  Generate another bucket number, see if that bucket is full,
        and if it is, keep generating new buckets until we find an empty
        bucket.

        To find an item in a hashed table, we get its key, divide by the
        table size, and look at the bucket that is represented by the
        remainder.  If that isn't the one, we generate the next bucket
        address, until we arrive at an empty bucket.  If we encounter
        the correct key BEFORE we arrive at an empty bucket, then we've
        found our entry.  If we arrive at an empty bucket, the record is
        not in the table.  And there you have hashing.

        A well designed hashing table will yield this number of collisions
        per insertion or search:


T1.0    Hashing Collision Table

        TABLE FULLNESS          COLLISIONS
        ==================================
             50%                   2.0
             60%                   2.5
             70%                   3.3
             90%                  10.0


=======>8 TABLE 1.0 ENDS HERE 8<=========

        That shows better results than even the binary search, with large
        lists!

        Research has shown that the most efficient hashing tables, that is,
        the ones with the least number of collisions, have a prime
        number of entries.  A table size of 1019 should produce less
        collisions than one of 1000.  Research has also shown that if the
        prime is of the form 4K+3, where K is any positive integer, then
        collisions are reduced even further.  1019 also meets this second
        requirement.  But, since a table size twice the size of the maximum
        number of entries it will ever hold is inefficient, the 4K+3
        criterion should be abandoned at a certain point in favor of any
        prime number.  Since most of us aren't idiot savants who can just
        come up with that number to suit our needs, here is a FUNCTION,
        written by Charles Graham, that accepts the maximum number of
        entries a table will have, and returns the proper type of prime
        number, to be used as a hashing table size:

S4.0    FSTPRIME.BAS [F210S04.BAS]

DEFINT A-Z
' This FUNCTION returns a prime number that is at least 30% greater than
' threshold.  It will TRY to return a prime number that also fits into the
' form 4K+3, where k is any integer, but if the prime number is twice the
' size of the threshold, it will ignore this criterion.
'
'       Written by Charles Graham, Tweaked by Quinn Tyler Jackson
'
FUNCTION funFirstPrime (threshold)
CONST TRUE = -1
CONST FALSE = NOT TRUE

tp30 = INT((threshold * 1.3) + .5)
IF tp30 / 2 = tp30 \ 2 THEN
    tp30 = tp30 + 1
END IF
c = tp30 - 2
IF c < 1 THEN
    c = 1
END IF
t2 = threshold * 2
DO
    c = c + 2
    FOR z = 3 TO SQR(c)
        ind = TRUE
        IF c / z = c \ z THEN
            ind = FALSE
            EXIT FOR
        END IF
    NEXT z
    IF ind THEN
        IF (c - 3) / 4 = INT((c - 3) / 4) OR c > t2 THEN
            funFirstPrime = c
            EXIT DO
        END IF
    END IF
LOOP
END FUNCTION

=======>8 SAMPLE 4.0 ENDS HERE 8<=========

Q3.1    How do I know when to use sequential searches, when to use
        binary searches, and when to use hashing?  Are there any sort
        of guidelines?

A3.1    Well, first let's consider where hashing is in its prime.  (You'll
        pardon that one, okay?)  It is best suited to dynamic list
        generation where items need to be added on a regular basis, but
        not deleted, since deletion is fairly difficult to implement on
        a hashed list.  The main strength of a hashing system is its
        ability to quickly insert new items into the table in such a
        manner that they can be located quickly "on-the-fly."   (See T1.0
        for the average number of collisions before locating the correct
        entry.)  Since the collisions increase with the ratio of full
        buckets to empty buckets, and not with the size of the actual
        table involved, hashing is more efficient than even binary
        searches when lists start to become huge.  Also, because the
        binary method of searching demands a sorted list, insertion of
        items at a later time becomes very cumbersome, even with such
        techniques as the QuickSort and pushing all entries after the
        insertion up by one.  (Try that technique on a list of 30,000
        items, when you only want to add two new items that land near
        the beginning of the list, and you'll know what disk wear and
        tear is all about!)

        Typical applications of the hashing algorithm involve word
        distribution counts, dictionary table generators that involve
        dictionaries that will be added to dynamically, and things of
        that nature.

        Consider the word distribution count problem.  Each word is a
        unique key, and so is perfect for hashing.  Sequential methods
        only work well up until the table has so many entries in it that
        looking up entries in the table becomes a real effort.  Remember,
        words already in the list do not need to be added twice.  Binary
        methods allow for quick searching, but each case of a new word
        being added to the list requires a sort or cumbersome insertion.
        This takes time, if a text file is of even average length.

        Hashing, on the other hand, can increment the count of words
        already in the list, or add new words to the list, without
        the overhead of sorting, sequential searches, or push-type
        insertion.  Also, remember that entry deletion is a problem with
        hashing.  Word distribution counts NEVER require entries to be
        struck, and so are well-suited to hashing systems.

        A good rule of thumb to determine which method may be best for a
        given problem is to cosider the points on this table:

T2.0    List Management System Ratings

                                      List  Type
                        SEQUENTIAL      BINARY          HASHED
                =====================================================
small list                  1              3              2
medium list                 3              1              2
large list                  3              2              1
huge list                   3              2              1

Insertion                   2              3              1
Modification                3              2              1
Deletion                    1              2              3
Browsing                    2              1              3

                     (Systems are ranked first, second, or third)

=======>8 TABLE 2.0 ENDS HERE 8<=========

        Using this table, we can see that the best method for short
        lists that require frequent deletions might be the sequential
        list.  The best for huge lists that require insertions,
        modifications, but not deletions (such as a nodelist index) is
        probably a hashed list.  A hashed list, however, will not do
        much for you if you regularly want to access the next item,
        first item in the list, or last item, such as in a list browsing
        system.  Hashed lists have no logical beginning or end, and
        for this reason, there is no such thing as a "first item" or
        "next item" in a hashed list.  Each entry is a single entity,
        retrievable only as a single entity, with no relation to any
        other entry in the hashed list.  This excludes applications
        that require browsing, as I have mentioned, but is perfect
        for symbol tables, dictionaries, and the like.

Q3.2    This is all pretty new to me.  Give me a practical review.

A3.2    Okay.  In the hashed list there is no sense of sequence in
        the classic sense of the concept.  Items are put into buckets
        based upon the type of calculation I have already discussed, and
        if the bucket is already in use, a new bucket is found according
        to a set system. Therefore, two similar items in a hashed table
        may actually have a physical distance of 500 entries between them.

        A practical example:

        We have a hash table 7 buckets big, and you want to store three
        entries in it, using hashing.  For simplicity, let's just store the
        characters A, B, and C, using their ASCII values as keys.  Their
        buckets would be:

        Item   Formula    Bucket
        =========================
          A    65 MOD 7     2
          B    66 MOD 7     3
          C    67 MOD 7     4

        No collisions have occured here, since this is a simple case.
        Now, let us add just one more item: H.  The first bucket that
        H will request is 72 MOD 2, or 2, which is being used by A.
        This is collision.  Now, we must find an empty bucket, and so,
        we apply a common method to the old bucket: we subtract an
        offset from 2.  The offset is calulated thus:

                Offset = TableSize - Bucket, or
                Offset = 7 -2
                Offset = 5

        Okay, now, whenever a collision occurs, we recalculate a position
        using this formula:

                NewPos = OldPos - Offset
                NewPos = 2 - 5
                NewPos = -3

        In cases where NewPos is less than 0, we then add the table size
        to the interim result:

                NewPos = NewPos + TableSize, or
                NewPos = -3 + 7
                NewPos = 4

        We see that this new bucket, 4, is being used by C, and so we
        have to recalculate the bucket one more time:

                NewPos = OldPos - Offset, or
                NewPos = 4 - 5
                NewPos = -1

                NewPos <0 so
                NewPos = NewPos + TableSize, or
                NewPos = -1 + 7
                NewPos = 6

        We see that 6 is an empty bucket, and therefore, our table now
        looks something like this:

        Entry   Bucket
        ==============
                  1 (empty bucket)
         A        2 (no collisions)
         B        3 (no collisions)
         C        4 (no collisions)
                  5 (empty bucket)
         H        6 (arrived at after two collisions)
                  7 (empty bucket)

        Now, remember from past explanations that searches are conducted
        by comparing each entry to the key until an empty bucket is reached.
        Therefore, to find A in the table, we calculate a bucket of
        65 MOD 7, or 2.  We look in bucket 2, and see that our key of A is
        the same as the table entry A.  We have therefore found our entry in
        one look!  Now, let's look for I.  That's a bit different, since
        it isn't in the list.  How many looks are needed to tell us that
        it isn't?  Well 73 MOD 7 is 3, and we see immediately that bucket
        3 is a B, not an I.  We recalculate the next bucket, and get:

                Offset = 4
                NewPos = (3 - 4) or -1
                Less than 0, so
                NewPos = 6

        Bucket 6 is occupied by an H, and so we calculate the next bucket:

                Offset = 4
                NewPos = (6-4) = 2

        Bucket 2 is occupied by an A, and so:

                NewPos = (2 - 4)
                NewPos = -2 + 7 = 5

        Finally, bucket 5 is empty.  Therefore, since we've arrived at
        an empty bucket BEFORE arriving at I, we can say that I is not
        in the list.  How many steps required?  Four.  Quite a bit of
        overhead on a short list of 7 entries, but consider a list of
        100,000 entries!  Four searches to find an item is fast!

Q3.3    Okay, how about a real working example of hashing in QuickBASIC,
        Quinn?  Theory is fine for CompSci freaks, but I'm a coffee and
        pizza programmer, not an egghead.

A3.3    I mentioned that one perfect use of hashing is for word distribution
        counters.  Here is one from Rich Geldreich that has been tweaked
        by me to account for some things that Rich did not know then
        about hashing table sizes.

S5.0    WORDHASH.BAS [F210S05.BAS]

'WORDHASH.BAS v1.10 By Rich Geldreich 1992
' Modified by Quinn Tyler Jackson for demonstrative purposes.
'
'Uses hashing to quickly tally up the frequency of all of the words in a
'text file. (This program assumes that words are seperated by either tab
'or space characters. Also, all words are converted to uppercase before
'the search.)
'

DEFINT A-Z
DECLARE SUB Show.Counts ()
DECLARE SUB Process.Line (A$)
DECLARE SUB UpdateFreq (A$, KeyIndex)
CONST TRUE = -1, FALSE = 0

DIM SHARED TableSize

Main:
 FileName$ = COMMAND$
 CLS
 LOCATE 1, 1
 PRINT "WORDHASH.BAS By Rich Geldreich 1992"
 PRINT "     Tweaked by Quinn Tyler Jackson 1993"
 OPEN FileName$ FOR INPUT AS #1 LEN = 16384

' In Rich's original version, the TableSize was set at 7000.  My version
' guesses at how large the table needs to be based on this:

' There are 5.5 characters in the average word.  Therefore, divide the
' text file length by 5.5.  For safety, assume that as many as
' half of those will be unique.  In normal text, half the words are in the
' hundred most common list, so this plays it pretty safe!  It will die
' if you take a file that is over about 50% unique words, however!  This
' is for NORMAL text files, not word dictionaries, where all entries are
' unique!
'
'SPLICE IN FROM EARLIER SAMPLE 4.0 IN THIS FAQ
'           VVVVVVVVVVVVV
TableSize = funFirstPrime(LOF(1) * .09)
REDIM SHARED WordTable$(TableSize)
REDIM SHARED Counts(TableSize)
DIM SHARED New.Words

 DO UNTIL EOF(1)
     LINE INPUT #1, A$
     Process.Line A$
     N = N + 1
     LOCATE 3, 1: PRINT N; "lines processed,"; New.Words; "words found"
 LOOP

SUB Process.Line (A$)

    ASEG = SSEG(A$) 'QuickBASIC 4.5 users change this to VARSEG(A$)
    AOFS& = SADD(A$)
    DEF SEG = ASEG + AOFS& \ 16

    AAddress = AOFS& AND 15
    Astart = AAddress
    AEndAddress = AAddress + LEN(A$)

    'get a word
    GOSUB GetAWord
    'update the frequency of the word until there aren't any words left
    DO WHILE Word$ <> ""
        UpdateFreq Word$, KeyIndex
        GOSUB GetAWord
    LOOP

    EXIT SUB

GetAWord:
    Word$ = ""

    'find a character
    P = PEEK(AAddress)
    DO WHILE (P = 32 OR P = 9) AND AAddress <> AEndAddress
        AAddress = AAddress + 1
        P = PEEK(AAddress)
    LOOP

    'if not at end of string then find a space
    IF AAddress <> AEndAddress THEN
        KeyIndex = 0
        GOSUB UpdateKeyIndex

        'remember where the character started
        WordStart = AAddress

        AAddress = AAddress + 1
        P = PEEK(AAddress)
        GOSUB UpdateKeyIndex
        'find the leading space
        DO UNTIL (P = 32 OR P = 9) OR AAddress = AEndAddress
            AAddress = AAddress + 1
            P = PEEK(AAddress)
            GOSUB UpdateKeyIndex
        LOOP
        KeyIndex = KeyIndex - L

        'make the word
   Word$ = UCASE$(MID$(A$, WordStart - Astart + 1, AAddress - WordStart))

    END IF
RETURN

UpdateKeyIndex:
    IF P >= 97 AND P <= 122 THEN
        L = P - 32
        KeyIndex = KeyIndex + L
    ELSE
        L = P
        KeyIndex = KeyIndex + L
    END IF
RETURN

END SUB

SUB UpdateFreq (A$, KeyIndex)
STATIC collisions
    'adjust the keyindex so its within the table
    KeyIndex = KeyIndex MOD TableSize
    'calculate an offset for retries
    IF KeyIndex = 0 THEN
        Offset = 1
    ELSE
        Offset = TableSize - KeyIndex
    END IF
    'main loop of hashing
    DO
        'is this entry empty?
        IF WordTable$(KeyIndex) = "" THEN
            'add this entry to the hash table
            WordTable$(KeyIndex) = A$
            New.Words = New.Words + 1
            IF New.Words = TableSize THEN
                BEEP
                PRINT : PRINT "Not enough room in word table!"
                END
            END IF
            EXIT SUB
        'is this what we're looking for?
        ELSEIF WordTable$(KeyIndex) = A$ THEN
            'increment the frequency of the entry
            Counts(KeyIndex) = Counts(KeyIndex) + 1
            EXIT SUB
        'this entry contains a string other than what we're looking for:
        'adjust the KeyIndex and try again
        ELSE
            collisions = collisions + 1
            LOCATE 5, 1: PRINT "Collisions: "; collisions
            KeyIndex = KeyIndex - Offset
            'wrap back the keyindex if it's <0
            IF KeyIndex < 0 THEN
                KeyIndex = KeyIndex + TableSize
            END IF
        END IF
    LOOP

END SUB

=======>8 SAMPLE 5.0 ENDS HERE 8<=========

                          END OF QUIK_BAS FAQ2.1

-- 
Jeffery Foy -- jfoy@glia.biostr.washington.edu
               1:343/58.206 or jeffery.foy@frenchc.eskimo.com
-*- Happy as a clam to be using Professional YAM -*-  :)
