The results of running a multi-variate linear regression on the
(somewhat massaged) ouput of the various tools for the source code
in a Pascal compiler (30,000 - 40,000 lines of code).  "Changes"
are the number of delta's (excluding version rollovers) in the sccs
files.  I assumed that was a good estimator for errors.

changes	= -0.1362 comments  +  0.00282 volume  -  0.1065 mccabe 
	  + 0.3178 returns  +  3.11129

Correlation Matrix:
changes		1.0000
comments	0.3670	1.0000
volume		0.8801	0.5137	 1.0000
mccabe		0.7912	0.4550	 0.8969	1.0000
returns		0.5319	0.2885	 0.5221	0.7635	1.0000
Variable	changes	comments volume	mccabe	returns

The equation was significant at an R-squared of 0.8034, with the
following t test values:

comments	3.5532
volume		13.3951
mccabe		3.5085
returns		4.5460


So, what does all this stuff mean?  Well, that is a good question.
Note that I am not sure what the mccabe and returns variables mean,
exactly.  I believe that I summed the mccabe and returns values for
each file, however I am not positive about that.

Obviously, the most important predictor of the number of changes in a
module is its halstead volume.  This is very intuitive.  Comments
apparently help cut down on changes.  This is also intuitive.  Code
complexity (mccabe) also seems to cut down on changes.  This is not at all
intuitive.  In fact, I don't understand it at all.  My guess is that there
were a few routines with much higher complexity than the others which were
not changed much, while a number of ``simple'' routines had many changes.

In other words, I think it is a statistical anomaly.  Clearly, additional
data is needed.  Finally, the number of returns contributes pretty
heavily to the change count.  This is also intuitive.  (I think there
are two reasons for this, (1) that code with many special cases requiring
returns/exits is not well thought out ahead of time, but rather, coded
as the implementor thought and (2) that changes to code with many returns
can be difficult, and can easily be incorrect.

You might think at this point that you have a magic formula for predicting
errors.  Unfortunately, although this stuff does a very good job of
predicting problems on the pascal project, the best R-sqared I ever
got for the RSM(*1) data was still less than .4.  Apparently, there are
other factors at work.  I believe that one very important factor is the
experience/skill of the original implementor and of those who made changes
to the module.  The uids of these people can be obtained from the sccs file.
You then need to have some way to associate a skill level with the person
(note that time may be important here too).  I never completed this final
step.  


*1 - RSM == Remote Software Management, a distributed source distribution
and control system.  See results2 for more info.

As always, feel free to contact me with questions or assistance in setting
this stuff up to work on a project you want to analyze.

Brian Renaud
