QUOTE(threepointone @ Mar 5 2007, 22:41)

All the p-values I've seen take a SUM of everything below a certain number of successful trials. That is, the p values we use don't represent the chance that, say, 46 out of 97 trials are randomly guessed correctly, but that at least 46 trials are randomly guessed correctly. I know getting the probability for a single number of trials will give werid results, and then we'd need a different cutoff p-value for each graph. But I just don't entirely understand how this way of using the binomial distributions can model the ABX test--for example, getting 0 right out of 100 trials clearly doesn't have a 100% chance of being random, as it should be as difficult to guess everything wrong as it is to guess everything right.
Now just to clarify, I'm not really trying to argue anything here, I'm just a little confused and need a bit of clarification.
What you want to look up and read about is one-tailed vs. two-tailed P-Values. The ABX programs used utilize a one-tailed P-Value system. This is on the hypothesis that you can correctly identify X to either A or B. The assumption here is that if you incorrectly choose X too many times, you clearly cannot identify the similarities between X and A or B, even if you're wrong way more than 50%. If you got 0 out of 100 trials correctly on an ABX test, it's most likely intentional, but the fact is, you're not making the connection when you should be.
A two tailed P-Value would be to take the sum from the end point to the closer edge and double it. This would be for binomial tests with no hypothesis. For example, if you thought you were psychic and wanted to guess heads or tails before it flipped, you would use this to prove or disprove psychic ability. If you couldn't get ANY correctly out of a large number of trials, you'd still have a low P-value, because it's highly improbable you'd get that many wrong by chance.
A two tailed P-Value calculation is more appropriate for AB testing, rather than ABX testing. If you were trying to figure out which song was better and were consistent, but wrong about it (even if the options were Blade mp3 and Flac), a two-tailed P-Value calculation would be the most appropriate calcuation to use to prove (or disprove) that you can hear a difference.
Using ABX over AB testing allows us to use fewer trials to establish that you can tell a difference, which is why we prefer the ABX test over AB testing. Hopefully this helps.