Help - Search - Members - Calendar
Full Version: ABX p-values / binomial distributions
Hydrogenaudio Forums > Hydrogenaudio Forum > General Audio
threepointone
I'm going to speak in non-statistics language here, since I honestly have not taken statistics yet and don't know too many details about the binomial theorem.

I'll probably be referring to a graph of the binomial distribution for the sake of ease in explaining what I mean. Binomial Calculator: http://www.stat.tamu.edu/~west/applets/binomialdemo.html

All the p-values I've seen take a SUM of everything below a certain number of successful trials. That is, the p values we use don't represent the chance that, say, 46 out of 97 trials are randomly guessed correctly, but that at least 46 trials are randomly guessed correctly. I know getting the probability for a single number of trials will give werid results, and then we'd need a different cutoff p-value for each graph. But I just don't entirely understand how this way of using the binomial distributions can model the ABX test--for example, getting 0 right out of 100 trials clearly doesn't have a 100% chance of being random, as it should be as difficult to guess everything wrong as it is to guess everything right.

Now just to clarify, I'm not really trying to argue anything here, I'm just a little confused and need a bit of clarification.
Dynamic
Quickly, what you try to do in ABX tests is:

Correctly state that either:
1. Sample X is Sample A or...
2. Sample X is Sample B

If you get it wrong 16 times out of 16, for example, you're either:

a. Unable to tell the difference and by an extreme fluke of luck didn't even get any right by chance (the chance you're guessing isn't exactly 100%, but out of 97 trials, the chance is so close to 100% that any software is likely to have rounded the chance to 100% by rounding to about 4 significant figures.

b. deliberately departing from the test method by intentionally picking out the wrong answer (and it's an assumption of the statistics that you're trying to complete the test properly or they become invalid).

In one case of an easy ABX, I started doing about 7 tests, and had the same answer (e.g. X=B, correctly identified) 7 times in a row, so I carried on just to see if it was just luck, and indeed from about the 9th to 16th samples I had the expected ratio of X=A and X=B, so it was just luck that it was one of the 2 permutations out of 128 that looks unrandom because by chance it allocates hidden sample X to the same reference sample.
sthayashi
QUOTE(threepointone @ Mar 5 2007, 22:41) *
All the p-values I've seen take a SUM of everything below a certain number of successful trials. That is, the p values we use don't represent the chance that, say, 46 out of 97 trials are randomly guessed correctly, but that at least 46 trials are randomly guessed correctly. I know getting the probability for a single number of trials will give werid results, and then we'd need a different cutoff p-value for each graph. But I just don't entirely understand how this way of using the binomial distributions can model the ABX test--for example, getting 0 right out of 100 trials clearly doesn't have a 100% chance of being random, as it should be as difficult to guess everything wrong as it is to guess everything right.

Now just to clarify, I'm not really trying to argue anything here, I'm just a little confused and need a bit of clarification.

What you want to look up and read about is one-tailed vs. two-tailed P-Values. The ABX programs used utilize a one-tailed P-Value system. This is on the hypothesis that you can correctly identify X to either A or B. The assumption here is that if you incorrectly choose X too many times, you clearly cannot identify the similarities between X and A or B, even if you're wrong way more than 50%. If you got 0 out of 100 trials correctly on an ABX test, it's most likely intentional, but the fact is, you're not making the connection when you should be.

A two tailed P-Value would be to take the sum from the end point to the closer edge and double it. This would be for binomial tests with no hypothesis. For example, if you thought you were psychic and wanted to guess heads or tails before it flipped, you would use this to prove or disprove psychic ability. If you couldn't get ANY correctly out of a large number of trials, you'd still have a low P-value, because it's highly improbable you'd get that many wrong by chance.

A two tailed P-Value calculation is more appropriate for AB testing, rather than ABX testing. If you were trying to figure out which song was better and were consistent, but wrong about it (even if the options were Blade mp3 and Flac), a two-tailed P-Value calculation would be the most appropriate calcuation to use to prove (or disprove) that you can hear a difference.

Using ABX over AB testing allows us to use fewer trials to establish that you can tell a difference, which is why we prefer the ABX test over AB testing. Hopefully this helps.
fluffy
Wait a minute, doesn't ABX use T-values instead of P?
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.