Inside Vahunz

Basically, Vahunz is just yet another search and replace tool. Still, it does not simply scan for a character sausage inside another character sausage because this would be far to slow if applied to the huge amount of text to be expected.

Instead, there are some other mechanisms working behind. For those who are interested, here is a short descriptions of them. You do not need to know this stuff for simply using this program.

Passes

Vahunz has to read all source codes passed in vahunz.files twice:

The first pass creates the list of names in memory.
The second pass read all files again, and checks for every name if it should be replaced by some other text.

Between these two passes, the dictionaries on the disk are updated, and the text for vahunzed names is created.

Collecting Names

Only whole words are taken into account. That means, characters are added to one name until a character shows up that indicates the beginning of a new name. Normal names consist of letters, digits and underscores (_).

But even after a name is complete, it does not mean that it makes sense to add it to the dictionary. Names matching one of the following criteria are stored in the vahunzed target without any change :

Short names according to --name-length.
Numeric values like 1234 and 0xbeef23
Special characters like mathematical operators

As such names do not need to be looked up in the name list, they can be processed faster.

Storing Names in Memory

If one vahunzes dozens of files, there can be thousands of names in them. When in the second pass, it has to be decided if and how to vahunz a name after reading it from the input,

That means, for every name read (with the above exceptions), the whole list of names has to be searched for it, before a decision can be made.

As this list usually contains a lot of names, an efficient data structure has to be used. In the case of Vahunz, this is an AVL-tree. Such a tree is similar to the "classic" binary tree, but avoids degeneration.

A lazy programmer does not write such tree functions himself. The tree used here is part of the Ubiqx module library. You can obtain it from http://www.interads.co.uk/~crh/ubiqx/index.html.

Vahunzed Names

When a vahunzed name is created, the following steps are preformed:

The first three random letters are always a lower case letter, a digit and an upper case letter - in exactly this sequence. This makes it very unlikely that a vahunzed name matches a reserved word or the name of a library function.

After that, possibly further random letters and digits have to be appended until the name is as long as specified with --vahunz-length.

As last step, further letters or digits are appended until no such existing name can be found in the tree. That means, every new letter results in a name lookup, which takes more time than the other steps described before.

In most cases, all vahunzed names will have three or maybe four letters. Only when there are really a lot of files and names, you will find longer ones.

Of course, all vahunzed names are also compaired with the non-vahunzed names, so there can not be any accidental matches.

Name Nodes

Every name node in the tree stores the following data:

The original name from the source code
The possible target name to be vahunzed/renamed to
The vahunz mode to be applied
A usage counter how often the name has occurred in all sources. This is only used for --no-unused.

The data structure is store in a way that there are only two calls to malloc() required because all possible strings for one node are stored within one memory area, only separated by a '\0' character. This significantly reduces memory fragmentation and the total amount of calls to the (slow) malloc() and free() functions.

The reason why there is not even only one allocation used per node is that the vahunzed names can only be created after all names have been read from the input. But the length of a vahunzed name is not known yet; --vahunz-length only gives a lower limit. Therefor every vahunzed name requires a free()/malloc() pair, which is the main reason why the "building vahunzed dictionary in memory" can take some time.