I wish our resident statistics expert, ff123, could put a rest to all of this

(Although IMHO tangent, Garf, and JohnV have already given good enough arguments to settle the matter, there are others in this forum that apparently disagree)
QUOTE
Originally posted by shday
Picking 12/16 and above as pass and 11/16 and below as fail is potentially throwing away alot of information!
12/16 is chosen as a cutoff because it indicates a >95% probability that the ABXer chose the sample because she heard a difference, not by random chance. 95% probability is a standard cutoff in statistics; 12/16 gives a 96% probability, while 11/16 gives a 90% probability.
The 95% cutoff is an arbitrary cutoff, but it works well. It still means that there is a 1 in 20 chance that the listener doesn't hear a difference and simply guessed well, but usually the listener does hear a difference and simply uses the ABX to prove it. Preferably, the listener will score even better than 12/16 (for example, 13/16 gives 99% and 16/16 gives >99.9%), but most people on this forum will accept 12/16, along with a good description of the problem, as proof that the listener hears a difference. Also, there is no reason why the listener should perform exactly 16 tests, although 16 does give a good balance between statistical significance and listener fatigue.
QUOTE
Originally posted by shday
Ten people score 10/16, each giving a ~77% probability that their test wasn't a fluke.
Ten people at ~77% each gives an overall confidence of more that 99.9% (much more I think) that they, as a group, didn't just get lucky.
Honestly, I don't mean to pick on shday at all

His conclusions do at first seem to have good logic behind them, but I believe the math and statistics contradict these claims. If I wasn't so rusty with my statistics, or if I had a statistics book handy, I could put the math right here.
Here's (if I remember correctly) the correct conclusion.
If a listener scores 10/16 on an ABX test, then there is about a 1 in 4 chance that the guesses were totally random. However, if he the next two
subsequent ABX tests are also 10/16, that gives a total ABX value of 30/48, which gives a satisfactory 95% probability. Note that this addition does
not apply if the listener gets 10/16, 3/16, 13/16, 10/16 on four sequential tests. Even with a 13/16 on an individual test, the total ABX up through the third test is 26/48 (65% probability) and through the fourth test is 36/64 (80% probability). By picking and choosing ABX tests, one invalidates his results.
All ABX tests performed must be included in the statistical probability.
This logic can be extended to an ideal listening group. Assuming 20 identical listeners in 20 identical environments (and this is
not the case on this forum), there is a certain statistical probability that one will get 12/16 correct on an ABX test purely by guessing. There is some math involved that I can't recall, but I do know that there is significance if one individual can get 12/16 ABX, but if that person is part of a group of many identical people, then that single 12/16 ABX test carries much less significance. This makes sense; flipping a coin five times in a row will probably not give five heads in a row on the first try, but if one keeps flipping, there is a good chance that five sequential heads will eventually occur.
This does not mean that if no one but JohnV can hear a difference, then JohnV is wrong; no two listeners are alike, so we do not have an ideal listening group, and the logic in the previous paragraph does not apply to this forum.
If ten people each score 10/16 ABX, it does not mean that the aggregate score was 100/160; it means that
no one person could reliably hear the difference between the two samples. However, if a single person takes ten sequential ABX tests and gets a 10/16 on each, that
does constitute a 100/160 ABX, which is very significant.
Now, another important concept to understand is that the ABX does
not test the relative quality of the audio files; it only says whether or not a person can hear
any difference at all. ABX gives a "yes or no" response, not a "moreso or less so" response. This is why ff123 conducts preference tests by having listeners rate each sample on a scale from 1 to 5 in 0.1 increments.
What
is true is that you can conclude from a group of ABX results whether one audio file has a
better chance of being audibly different from the original than another.. If, from a group of 30 people, 20 of them reliably ABX sample A but only 10 of them reliably ABX sample B, then you can conclude that sample A is more likely to be detected than sample B. You cannot, however, get a user's preference in this way; that is what user comments and preference tests are for.
I
think you can make the following statement though: if you tell 30 listeners to only listen for high-frequency roll-off in the sample files A and B (and this isn't realistic, but assume that roll-off is the only artifact the listeners can detect), and 20 of them ABX sample A with a high significance but only 10 ABX sample B with a high significance, then you can say that sample A has a more audible high-frequency roll-off.
Phew, this is quite a post! But hopefully it clears up some misconceptions, rather than adding new ones.