Wednesday, 10 December 2008

Measuring TDD Effectiveness

There is a new video on Channel 9 about the findings of an experimental study of TDD. They compared some projects at Microsoft and IBM with and without TDD, and published a paper with their findings (which can be downloaded here).

The headline results were that the TDD teams had a 40-90% lower pre-release "defect density", at the cost of a 15-35% increase in development time. To put those figures into perspective, imagine a development team working for 100 days and producing a product with 100 defects. What kind of improvement might they experience if they used TDD?

  Days Defects
Non TDD 100 100
TDD worst case 115 60
TDD best case 135 10

Here's a few observations I have on this experiment and the result.

Limitations of measuring Time + Defects

From the table above it seems apparent that a non-TDD team could use the extra time afforded by their quick and dirty approach to development to do some defect fixing and would therefore able to comfortably match the defect count of a TDD team given the same amount of time. From a project manager perspective, if their only measures of success are in terms of time and defect count, then they will not likely perceive TDD as being a great success.

Limitations of measuring Defect Density

The test actually measures "defect density" (i.e. defects per KLOC) rather than simply "defects". This is obviously to give a normalised defect number as the comparable projects were not exactly the same size, but are a couple of implications about this metric.

First, if you write succinct code, your defect density is higher than verbose code with the same number of defects. I wonder whether this might negatively penalise TDD which ought to produce fewer LOC (due to not writing more than is needed to pass the tests). Second, the number of defects found pre-release often says more about the testing effort put in than the quality of the code. If we freeze the code and spend a month testing, the bug count goes up, and therefore so does the defect density, but the quality of the code has not changed one bit. The TDD team may well have failing tests that reveal more limitations of their system than a non-TDD team would have time to discover.

In short, defect density favours the non-TDD team so the actual benefits may be even more impressive than those shown by the study.

TDD results in better quality code

The lower defect density results of this study vindicate the idea that TDD will produce better quality code. I thought it was disappointing that the experiment didn't report any code quality metrics, such as cyclomatic complexity, average class lengths, class coupling etc. I think this may have revealed an even bigger difference between the two approaches. As it is the study seems to assume that quality is simply proportional to defect density, which in my view is a mistake. You can have a horrendous code-base that through sheer brute force has had most of its defects fixed, but that does not mean it is high quality.

TDD really needs management buy-in

The fact that TDD takes longer means that you really need management buy-in to introduce TDD in your workplace. Taking 35% longer does not go down well with most managers, unless they really believe it will result in significant quality gains. I should add also, that to do TDD effectively, you need a fast computer, which also requires management buy-in. TDD is based on minute by minute cycles of writing a failing unit test, and writing code to fix it. There are two compile steps in that cycle. If your compile takes several minutes then TDD is not a practical way of working.

TDD is a long-term win

Two of the biggest benefits of TDD relate to what happens next time you revisit your code-base to add more features. First is the comprehensive suite of unit tests that allow you to refactor with confidence. Second is the fact that applications written with TDD tend to be much more loosely coupled, and thus easier to extend and modify. This is another thing the study failed to address at all (perhaps a follow-up could measure this). Even if for the first release of your software the TDD approach was worse than non-TDD, it would still pay for itself many times over when you came round for the second drop.

No comments: