Issue #002
November, 1995


Contents:

Using C++ as a Better C Part 2 - references
Introduction to Namespaces Part 2 - using namespace
Writing Robust C++ Code Part 1 - assert() and subscript checking
Performance - handling lots of small strings via a C++ class


USING C++ AS A BETTER C - PART 2

In the last newsletter we discussed using function prototypes in C++
to eliminate a common type of error encountered in C, that of calling
a function with the wrong number or types of arguments.  Another C++
feature that can be used to reduce programming errors is known as
references.

A reference is another name for an object.  For example, in this code:

        int i;

        int& ir = i;

ir is another name for i.  To see how references are useful, and also
how they're implemented, consider writing a function that has two
return values to pass back.  In ANSI C, we might say:

        void f(int a, int b, int* sum, int* prod)
        {
                *sum = a + b;
                *prod = a * b;
        }

        void g()
        {
                int s;
                int p;

                f(37, 47, &s, &p);
        }

In C++, we would say:

        void f(int a, int b, int& sum, int& prod)
        {
                sum = a + b;
                prod = a * b;
        }

        void g()
        {
                int s;
                int p;

                f(37, 47, s, p);
        }

One way of viewing references is to consider that they have some
similarities to C pointers, but with one level of pointer removed.
Pointers are a frequent source of errors in C.

A reference must be initialized, and its value (the pointed at object)
cannot be changed after initialization.  The value of the reference
cannot change, but the value of the referenced object can, unless the
reference is declared as const.  So, for example:

        int i = 0;
        int& ir = i;
        ir = -19;                       // i gets the value -19

is acceptable, while:

        const int& irc = 47;
        irc = -37;                      // error

is not.  A constant reference that points at a value like 47 can be
implemented using a temporary.

References are useful in argument passing and return.  Another use is
illustrated below in the section on writing robust code.


INTRODUCTION TO NAMESPACES - PART 2

In the last issue we discussed one of the problems solved by
namespaces, that of conflicting class names.  You might be using class
libraries from two different vendors, and they both have a String
class in them.  Using namespaces can help to solve this problem:

        namespace Vendor1 {
                class String { ... };
        }

        namespace Vendor2 {
                class String { ... };
        }

How would you actually use the String classes in these namespaces?
There are a couple of common ways of doing so.  The first is simply to
qualify the class name with the namespace name:

        Vendor1::String s1, s2, s3;

This usage declares three strings, each of type Vendor1::String.

Another approach is to use a using directive:

        using namespace Vendor1;

Such a directive specifies that the names in the namespace can be used
in the scope where the using directive occurs.  So, for example, one
could say:

        using namespace Vendor1;

        String s1, s2, s3;

and pick up the String class found in the Vendor1 namespace.

What happens if you say:

        using namespace Vendor1;
        using namespace Vendor2;

and both namespaces contain a String class?  Will there be a
conflict?  The answer is "no", unless you try to actually use String.
The using directive doesn't add any names to the scope, but simply
makes them available.  So usage like:

        String s1;

would give an error, while:

        Vendor1::String s1;
        Vendor2::String s2;

would still work.

You might have noticed that namespaces have some similarities of
notation with nested classes.  But namespaces represent a more general
way of grouping types and functions.  For example, if you have:

        class A {
                void f1();
                void f2();
        };

then f1() and f2() are member functions of class A, and they operate
on objects of class A (via the "this" pointer).  In contrast, saying:

        namespace A {
                void f1();
                void f2();
        }

is a way of grouping functions f1() and f2(), but no objects or class
types are involved.


WRITING ROBUST C++ CODE - PART 1

Many of the techniques used in writing robust C code also apply in
C++.  For example, if you have a function that is supposed to be
passed a person's name, as a C-style string, it would be wise to say:

        #include <assert.h>

        void f(char* name)
        {
                assert(name && *name);

                ...
        }

to perform basic checks on the passed-in pointer.  assert() is a
function (actually a macro) that checks whether its argument is true
(non-zero), and aborts the program if not.

But C++ offers additional opportunities to the designer interested in
producing quality code.  For example, consider a common problem in C,
where vector bounds are not checked during a dereference operation,
and a bad location is accessed or written to.

In C++, you can partially solve this problem by defining a Vector
class, with a vector dereferencing class member defined for the
Vector, and the vector size stored:

#include <stdio.h>
#include <assert.h>

class Vector {
        int len;                // number of elements
        int* ptr;               // pointer to elements
public:
        Vector(int);            // constructor
        ~Vector();              // destructor
        int& operator[](int);   // dereferencing
};

//constructor
Vector::Vector(int n)
{
        assert(n >= 1);

        len = n;                // store length
        ptr = new int[n];       // get storage to store elements
        assert(ptr);
}

//destructor
Vector::~Vector()
{
        delete ptr;
}

//dereferencing
int& Vector::operator[](int i)
{
        assert(i >= 1 && i <= len);

        return ptr[i - 1];      // return reference to vector slot
}

//driver program
main()
{
        int i;
        const int N = 10;
        Vector v(N);

        for (i = 1; i <= N; i++)        // correct usage
                v[i] = i * i;

        for (i = 1; i <= N; i++)        // correct usage
                printf("%d %d\n", i, v[i]);

        v[0] = 0;                       // will trigger assert failure

        return 0;
}

In this example, we create a vector of 10 elements, and the vector is
indexed 1..10.  If the vector is dereferenced illegally, as in:

        v[0] = 0;

an assertion failure will be triggered.

One objection to this technique is that it can be slow.  If every
vector reference requires a function call (to Vector::operator[]),
then there may be a large performance hit.  However, performance
concerns can be dealt with by making the dereferencing function inline.

Two other comments about the above example.  We are assuming in these
newsletters that if operator new() fails, it returns a NULL pointer:

        ptr = new int[n];
        assert(ptr);            // check for non-NULL pointer

The current draft ANSI standard says that when such a failure occurs,
an exception is thrown or else a new handler is invoked.  Because many
C++ implementations still use the old approach of returning NULL, we
will stick with it for now.

The other comment concerns the use of references.  In the code:

        v[i] = i * i;

the actual code is equivalent to:

        v.operator[](i) = i * i;

and could actually be written this way (see a C++ reference book on
operator overloading for details).

Vector::operator[] returns a reference, which can be used on the
left-hand side of an assignment expression.  In C the equivalent code
would be more awkward:

#include <stdio.h>

int x[10];              // use f() to index into x[10]

int* f(int i)
{
        return &x[i - 1];
}

main()
{
        *f(5) = 37;

        printf("%d %d\n", *f(5), x[4]);

        return 0;
}


PERFORMANCE TIPS

In performance tips this issue, we will present a complete C++ class
along with its implementation code, which is appended to the end of the
discussion.

This C++ example addresses a common problem in applications that use a
lot of small strings.  The problem has to do with the overhead
associated with allocating the string via malloc() (in C) or operator
new() (C++).  Typically, such overhead is 8-16 bytes per string.  And
allocating then deallocating many small blocks will tend to fragment
memory.

The Copy class deals with the problem by internally allocating large
blocks and then shaving off small chunks for individual strings.  It
keeps track of all the large blocks allocated and deallocates them
when a given Copy object is no longer needed.  To use this system, you
would allocate a Copy object for each major subsystem in your
application that uses small strings.  For example, at one point in
your application, you might need to read in a dictionary from disk and
use it for a while.  You would allocate a Copy object and then use it
to allocate the strings for each word, then flush the strings all at
once.

In the application that this class was devised for, implementing
string copying in this way saved 50K out of a total available memory
pool of 500K.  This is with Borland C++, which rounds the number of
requested bytes for a string to the next multiple of 16, or an average
wastage of 8 bytes.  Since the Copy class uses 1024-byte chunks, on
average 512 bytes will be wasted for a given set of strings, so the
breakeven point would be 512 / 8 = 64 or more strings.

There are many variations on this theme.  For example, if you are
certain that the strings will never be freed, then you can simply grab
a large amount of memory and shave chunks off of it, without worrying
about keeping track of the allocated memory.  Or if you have many
objects of one class, such as tree nodes, you can overload operator
new() for that class to do a similar type of thing.

Note that this particular storage allocator is not general.  The
allocated storage is aligned on 1-byte boundaries.  This means that
trying to allocate other than char* objects may result in performance
degradation or a memory fault (such as "bus error" on UNIX systems).
And the performance gains of course decline somewhat with large
strings, while the wastage increases from stranding parts of the
1024-byte allocated chunks.

This same approach could be used in C or assembly language, but C++
makes it easier and encourages this particular style of programming.

An example of usage is included.  A dictionary of 20065 words with
total length 168K is read in.  Without use of the Copy class it
requires 354K, an 111% overhead.  With the Copy class it takes 194K,
an overhead of 15%.  This is a difference of 160K, or 8 bytes per
word.  The results will of course vary depending on a particular
operating system and runtime library.  And the Copy version runs about
20% faster than the conventional version on a 486 PC.

The driver program that is included will work only with Borland C++,
so you will need to write some other code to emulate the logic.

#include <string.h>
#include <assert.h>

const int COPY_BUF = 1024;                      // size of buffer to get
const int COPY_VEC = 64;                        // starting size of vector

class Copy {
        int ln;                                 // number of buffers in use
        int maxln;                              // max size of vector
        char** vec;                             // storage vector
        int freelen;                            // length free in current
        char* freestr;                          // current free string
public:
        Copy();                                 // constructor
        ~Copy();                                // destructor
        char* copy(char*);                      // copy a string
};

// constructor
Copy::Copy()
{
        ln = 0;
        maxln = 0;
        vec = 0;
        freelen = 0;
        freestr = 0;
}

// destructor
Copy::~Copy()
{
        int i;

        // delete buffers

        for (i = 0; i < ln; i++)
                delete vec[i];

        // delete vector itself

        if (vec)
                delete vec;
}

// copy a string
char* Copy::copy(char* s)
{
        int i;
        char** newvec;
        int len;
        char* p;

        assert(s && *s);

        len = strlen(s) + 1;

        // first time or current buffer exhausted?

        if (len > freelen) {
                if (!vec || ln == maxln) {

                        // reallocate vector

                        maxln = (maxln ? maxln * 2 : COPY_VEC);
                        newvec = new char*[maxln];
                        assert(newvec);
                        for (i = 0; i < ln; i++)
                                newvec[i] = vec[i];
                        if (vec)
                                delete vec;
                        vec = newvec;
                }

                // allocate new buffer

                vec[ln] = new char[COPY_BUF];
                assert(vec[ln]);
                freelen = COPY_BUF;
                freestr = vec[ln];
                ln++;
        }

        // allocate and copy string

        freelen -= len;
        p = freestr;
        freestr += len;
        strcpy(p, s);
        return p;
}

#ifdef DRIVER

#include <stdio.h>
#include <alloc.h>

main()
{
        long cl;
        const int MAXLINE = 256;
        char buf[MAXLINE];
        FILE* fp;
        char* s;
#ifdef USE_COPY
        Copy co;
#endif

        cl = coreleft();
        fp = fopen("c:/src/words", "r");
        assert(fp);
        while (fgets(buf, MAXLINE, fp) != NULL) {
#ifdef USE_COPY
                s = co.copy(buf);
#else
                s = new char[strlen(buf) + 1];
                assert(s);
                strcpy(s, buf);
#endif
        }
        fclose(fp);
        printf("memory used = %ld\n", cl - coreleft());

        return 0;
}

#endif

-------------------------

Copyright (c) 1995 Glen McCluskey.  All Rights Reserved.

This newsletter may be further distributed provided that it is copied
in its entirety, including the newsletter number at the top and the
copyright and contact information at the bottom.

Glen McCluskey & Associates
Professional C++ Consulting
Internet: glenm@glenmccl.com
Phone: (970) 490-2462
Fax: (970) 490-2463
FTP: rmii.com /pub2/glenm/newslett (for back issues)
Web: http://www.rmii.com/~glenm
