QUOTE(ff123 @ Mar 5 2004, 05:34 PM)
To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap. Or to put it another way, 19 times out of 20, those results would not occur by chance. Any overlap reduces that confidence. If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance. A reasonable way to describe this situation would be to say that the results are suggestive (if not significant). Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy.
If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant.
Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more. I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis. But others, for example on slashdot, can (and do) complain about this sort of thing.
ff123
Right, well, with 95% confidence for the tested 12 samples:
iTunes is better than Real,FAAC and Compaact
Nero is better than Real and Compaact
With lower confidence for the tested 12 samples:
Nero is better than FAAC (small overlap)
With even lower confidence for the tested 12 samples:
iTunes is better than Nero (a bit bigger overlap than with Nero-FAAC)
Correct?