Basically, Vahunz is just yet another search and replace tool. Still, it does not simply scan for a character sausage inside another character sausage because this would be far to slow if applied to the huge amount of text to be expected.
Instead, there are some other mechanisms working behind. For those who are interested, here is a short descriptions of them. You do not need to know this stuff for simply using this program.
Vahunz has to read all source codes passed in vahunz.files twice:
Between these two passes, the dictionaries on the disk are updated, and the text for vahunzed names is created.
Only whole words are taken into account. That means, characters are
added to one name until a character shows up that indicates the
beginning of a new name. Normal names consist of letters, digits and
underscores (_
).
But even after a name is complete, it does not mean that it makes sense to add it to the dictionary. Names matching one of the following criteria are stored in the vahunzed target without any change :
--name-length
.
1234
and 0xbeef23
As such names do not need to be looked up in the name list, they can be processed faster.
If one vahunzes dozens of files, there can be thousands of names in them. When in the second pass, it has to be decided if and how to vahunz a name after reading it from the input,
That means, for every name read (with the above exceptions), the whole list of names has to be searched for it, before a decision can be made.
As this list usually contains a lot of names, an efficient data structure has to be used. In the case of Vahunz, this is an AVL-tree. Such a tree is similar to the "classic" binary tree, but avoids degeneration.
A lazy programmer does not write such tree functions himself. The tree used here is part of the Ubiqx module library. You can obtain it from http://www.interads.co.uk/~crh/ubiqx/index.html.
When a vahunzed name is created, the following steps are preformed:
The first three random letters are always a lower case letter, a digit and an upper case letter - in exactly this sequence. This makes it very unlikely that a vahunzed name matches a reserved word or the name of a library function.
After that, possibly further random letters and digits have to be appended
until the name is as long as specified with --vahunz-length
.
As last step, further letters or digits are appended until no such existing name can be found in the tree. That means, every new letter results in a name lookup, which takes more time than the other steps described before.
In most cases, all vahunzed names will have three or maybe four letters. Only when there are really a lot of files and names, you will find longer ones.
Of course, all vahunzed names are also compaired with the non-vahunzed names, so there can not be any accidental matches.
Every name node in the tree stores the following data:
--no-unused
.
The data structure is store in a way that there are only two calls
to malloc()
required because all possible strings for one
node are stored within one memory area, only separated by a
'\0'
character. This significantly reduces memory
fragmentation and the total amount of calls to the (slow)
malloc()
and free()
functions.
The reason why there is not even only one allocation used per node
is that the vahunzed names can only be created after all
names have been read from the input. But the length of a vahunzed name
is not known yet; --vahunz-length
only gives a lower
limit. Therefor every vahunzed name requires a
free()
/malloc()
pair, which is the main
reason why the "building vahunzed dictionary in memory
"
can take some time.