Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Is double-blind testing potentially biased? (Read 9748 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Is double-blind testing potentially biased?

I've come up with a thought experiment/question which leads me to believe that DBT comparisons between two ABX tests may have somewhat compromised significance under what are normally considered acceptable testing conditions.

Let's say you have a two-codec comparison test for transparency, with encoders A and B being blind-tested against the original material. Let us further suppose that encoder A has a very high reputation of transparency, while encoder B is either relatively unknown or has a lesser reputation for transparency for whatever reason.

Now, if Joe Blind Tester walks into this test, I'm going to assert that Joe will spend less time looking at encoder A's transparency than encoder B's transparency. I mean, Joe already knows that A is transparent, and B is of unproven transparency, so couldn't that inject an unconscious stimulus to spend more time trying to pick apart B? This may lead to A's results being overly inflated.

The general issue is: Can testers systematically spend less effort comparing option A to option C, as opposed to comparing option B to option C, in double-blind tests? I believe it is possible, and if so, the existing "safe practices" for blind testing would not be sufficient to prevent biased results.

ABC/HR would, I believe, be immune to this effect because it is capable of testing multiple samples simultaneously. foobar's abx component wouldn't, and in fact you'd need to blindly choose between AB/AC sample sets in order to mitigate the effect in foobar testing. Hardware testing between three different hardware components and a straight A/B blind test would require that the listener not know which 2 components were under test - ie, it needs to be a 3-component test.

Comments?

Is double-blind testing potentially biased?

Reply #1
Yes, that's possibly a source of bias.  In the ideal circumstances, the tester doesn't know what's under comparison.

ff123

Is double-blind testing potentially biased?

Reply #2
It's not biased unless you make a conclusion out of both positive and negative results. If you only make a conclusion in the case of a positive result there can be no bias.

Is double-blind testing potentially biased?

Reply #3
Also, couldn't such a test be done in which there is sample sets 1,2,3, etc., in which the tester (assuming software-based, like using FB2K's plugin) does not know what he is testing for, except that one is the original, and one is not?

Is double-blind testing potentially biased?

Reply #4
If i remember right, then in the previous large scale DBTs on ha.org the listeners did NOT know which codec is which one(roberto's multiformat 128kbit test comes to mind).

So, i think the raised question does not affect DBTs in general and also not large-scale listening tests if done right. But it is a good reminder to also be careful about this in personal comparision tests or other small-scale tests.

- Lyx
I am arrogant and I can afford it because I deliver.

Is double-blind testing potentially biased?

Reply #5
The problem is not that you know which codec is which, but that you usually know what codecs are under scrutiny. So, if you are testing MPC --standard, MP3 -V4 and WMA 8 Std , 96 kbps, you already expect one codec to be very probably transparent, one sounding good but with some probability of artifacting and one quite bad. This information could in some circumstances lead to less-than-optimal testing behaviour, with subject expectations that ideally should be absent. For example, if you expect a high level of transparency on one sample, you might give up more easily when you encounter one that sounds transparent on the first listen. Conversely, if you know you are testing a file that you know should bearound your threshold of artifact detection, you may make a greater effort.
Of course, on multi-format tests at more or less equivalent qualities this skew is greatly diminished, since you don't know which one is the transparent one or the almost imperceptible, but the (theoretical) bias is still there.

Is double-blind testing potentially biased?

Reply #6
The experiment in question (ABX) is different than codec testing. First, you only have two samples, it's not an ABC test. Second, the ABX experiment does not attempt to anwer "which one sounds better", only to answer "can the subject find an audible difference". Expectations as to which is supposed or expected to sound better are irrelevant. Note that a positive result only implies that there exists a subject that can tell the difference (so we can say "there is a difference"), but one cannot draw any conclusion about the audibility of the sample in the (human) population at large.

The only thing that can happen in an ABX test is that the subject doesn't try hard enough,  so he reports no difference. This however does not prove that "there is no difference" (because negating "there exists a subject that can tell the difference" yields "no subject can tell the difference"). To prove the latter (in statistical terms), you need to run the experiment on a large enough population of subjects.

Third, any conclusion of the kind "A sounds better than B" is purely subjective. In the ABC/HR codec test, this becomes "statistically objective" only as in "a significant majority prefers A to B". It does not mean that any subject will prefer A to B.

So, an individual ABX test is not biased in any way by individual expectations as far as proving that the individual can tell the difference between A & B.
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #7
Suppose you decide to ABX  lossless vs. Musepack --insane on a random sample. You know MPC -q7 is very likely to be transparent on most samples and on most people. You start the first trial and find no clear artifact you can focus on. Knowing the high transparency of the sample, after a couple tries on different passages you are likely to say "Nah, it's transparent" and give up.
Now you ABX lossless vs. LAME -V4. For most regular folks here (suppose you are one), -V4 is going to sound pretty good, probably transparent most of the time, but you know the likelihood of (slight) artifacting on some passage is still somewhat high. Therefore, you look more for potentially problematic passages and concentrate harder on finding differences, since everybody likes to be able to say "Yeah, I ABXed this and that".
Thus, conditions that are expected/hoped to be ABXable could possibly be overrepresented due to increased effort. I think the global effect of this is not important, but the bias is still there.

Is double-blind testing potentially biased?

Reply #8
so, after all, it IS about knowing which test-file belongs to which encoder.

Unless of course there is another way to nullify the mentioned problem while still using the ABX method - but i cannot think of any.

- Lyx
I am arrogant and I can afford it because I deliver.

Is double-blind testing potentially biased?

Reply #9
Quote
Suppose you decide to ABX  lossless vs. Musepack --insane on a random sample. You know MPC -q7 is very likely to be transparent on most samples and on most people. You start the first trial and find no clear artifact you can focus on. Knowing the high transparency of the sample, after a couple tries on different passages you are likely to say "Nah, it's transparent" and give up.


This is where you are wrong. I clearly stated above that in an ABX experiment, you can only "prove" p="there is a difference", but you cannot prove non-p if the experiment fails. In statistical "logic" lack of significance for p does NOT imply statistical significance for non-p, it may simply mean that there's another hypotesis that is true, e.g.: "the subject was drunk", or "the subject didn't bother to try". You'll need a difference experiment to prove non-p = "there is no difference".

Short of getting a statistical inference book, look here.

Quote
Now you ABX lossless vs. LAME -V4. For most regular folks here (suppose you are one), -V4 is going to sound pretty good, probably transparent most of the time, but you know the likelihood of (slight) artifacting on some passage is still somewhat high. Therefore, you look more for potentially problematic passages and concentrate harder on finding differences, since everybody likes to be able to say "Yeah, I ABXed this and that".
Thus, conditions that are expected/hoped to be ABXable could possibly be overrepresented due to increased effort. I think the global effect of this is not important, but the bias is still there.
[a href="index.php?act=findpost&pid=251830"][{POST_SNAPBACK}][/a]


The test you described: (1) AB of A vs. B, then (2) AB A vs. C is flawed because you know when you that in test (1) there's no C, and in test (2) there's no B. You cannot "add" two AB tests like that and get a valid ABC test. The "compound" test is NOT blind, so it is biased.
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #10
Since I was probably oversimplifying too much in the 1st part of my previous post, here is a more detailed explanation:

Statistical inference is based on eliminating error, not on finding truth. So, if you want to determine if "the subject can tell the difference between A & B" (hypothesis H1), the experiment tries to reject the null hypothesis: "A & B sound the same to the subject" (H0 = not H1). If enogh trials are performed successfully, i.e. X is guessed correctly, the chi-square (or binomial) test disproves the null hypothesis H0, thus affirming the hypothesis H1. But, if the null hypothesis H0 fails to be rejected with statistical significance, it does not mean H0 is true, so nothing can be said about H1. Lack of evidence for H0, does not disprove H1.

Thus, ABX cannot ever be used to prove that "A & B sound the same", or that "there is no difference between A & B", even if you restrict the statements to the subject being tested. So you can never prove that some codec is transparent by ABX testing. Odd, but true.
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #11
Quote
This is where you are wrong. I clearly stated above that in an ABX experiment, you can only "prove" p="there is a difference", but you cannot prove non-p if the experiment fails. In statistical "logic" lack of significance for p does NOT imply statistical significance for non-p, it may simply mean that there's another hypotesis that is true, e.g.: "the subject was drunk", or "the subject didn't bother to try". You'll need a difference experiment to prove non-p = "there is no difference".

What do you mean by "difference experiment"? In the present context, isn't an ABX or ABC test the only valid way of producing statistical results? Or are you referring to a test which specifically tests for a specific type II error, just like most ABX tests aim for a specific type I error?

In any case, this doesn't make the overall issue any less relevant, because transparent encoders have not been rigorously tested against a transparency hypothesis (be it difference experiment or calculated type II or whatnot) in any test I am aware of. In fact, there isn't even a consensus on what would constitute such a test. IOW, "transparent" encoders have not been statistically proven here to be transparent. (Obviously problem samples always temper such a statement, but the problem still stands.)

Not that that's a terribly bad thing, it just needs to be made clear.

Quote
The test you described: (1) AB of A vs. B, then (2) AB A vs. C is flawed because you know when you that in test (1) there's no C, and in test (2) there's no B. You cannot "add" two AB tests like that and get a valid ABC test. The "compound" test is NOT blind, so it is biased.
[{POST_SNAPBACK}][/a]


That's what I'm saying, and virtually all the tests conducted here are thusly biased. At the very least, [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=20715]the entire 3.96 vs 3.90.3 thread[/url] is guilty of this reasoning.

And yet another disclaimer, just to make sure the subjectivists don't go parading around saying "HA's tests are flawed! HA's tests are flawed!": This only may weaken the test results. Positive ABX results are still rock solid, and negatives are still very strong, especially multiply validated ones.

Is double-blind testing potentially biased?

Reply #12
Quote
Quote
This is where you are wrong. I clearly stated above that in an ABX experiment, you can only "prove" p="there is a difference", but you cannot prove non-p if the experiment fails. In statistical "logic" lack of significance for p does NOT imply statistical significance for non-p, it may simply mean that there's another hypotesis that is true, e.g.: "the subject was drunk", or "the subject didn't bother to try". You'll need a difference experiment to prove non-p = "there is no difference".

What do you mean by "difference experiment"?


A noninferiority test. I'm too lazy to explain it here, but google found this article.

Quote
Or are you referring to a test which specifically tests for a specific type II error, just like most ABX tests aim for a specific type I error?

In any case, this doesn't make the overall issue any less relevant, because transparent encoders have not been rigorously tested against a transparency hypothesis (be it difference experiment or calculated type II or whatnot) in any test I am aware of. In fact, there isn't even a consensus on what would constitute such a test. IOW, "transparent" encoders have not been statistically proven here to be transparent. (Obviously problem samples always temper such a statement, but the problem still stands.)

Not that that's a terribly bad thing, it just needs to be made clear.


Since you cannot quantify the hearing difference in one person, for a non-inferiority test you'd need multiple participants and quantify the difference based on number of subjects that can tell the difference. Oh, and you'd need identical equipment for all of them. Which probably precludes any such test on HA.

Quote
Quote
The test you described: (1) AB of A vs. B, then (2) AB A vs. C is flawed because you know when you that in test (1) there's no C, and in test (2) there's no B. You cannot "add" two AB tests like that and get a valid ABC test. The "compound" test is NOT blind, so it is biased.
[{POST_SNAPBACK}][/a]


That's what I'm saying, and virtually all the tests conducted here are thusly biased.
[a href="index.php?act=findpost&pid=251901"][{POST_SNAPBACK}][/a]


Hold your horses. All I'm saying is that doing only AB and AC test is not the same as an ABC test. But, if you add an BC test then it almost is. I say almost, because pairwise difference is not equivalent to difference amongs all pairs. If you scale this pairwise experiment to many pairs, you may find a pair that is different solely by accident. To put it in practical terms: pairwise t-test is not equivalent to ANOVA. Further [a href="http://www.physics.csbsju.edu/stats/anova.html]reading[/url].

I think you were thinking that ABC=ABC/HR, when I meant ABC = ANOVA. 

ABC/HR is a different (sic) matter. Once you prove that you can tell the difference between A & B, then you are free to formulate a preference. I haven't used ABC/HR, but, if I understand it correctly, all it does is ask the subject to rank the differences between A (the original) and B, C, D, ..., and then runs a non-parametric test to determine if the rankings are significant. And you need multiple subjects because you are aiming for a significant result over a population sample. The only attack on validity I see is the fact that subjects used different equipment.
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #13
Quote
Since you cannot quantify the hearing difference in one person, for a non-inferiority test you'd need multiple participants and quantify the difference based on number of subjects that can tell the difference. Oh, and you'd need identical equipment for all of them. Which probably precludes any such test on HA.

Well, the basic idea of what I think you're saying is that if testers can rate the quality of the encoded sample, using a 1-5 method like ABC/HR, or maybe even use a straight ABX test and calculate some sort of power level based on how positive the test result is, we can define an result interval of the average score of the testers inside which we consider the encoding transparent. I think I see how this would work.

Why would you need identical equipment? The whole notion of having different listeners is that they will all have different ears and different listening sensitivities. Why wouldn't having different equipment be roughly the same as having different ears? (once the obvious objections are taken care of like equalizing stuff beyond the ATH)
Quote
Hold your horses. All I'm saying is that doing only AB and AC test is not the same as an ABC test. But, if you add an BC test then it almost is. I say almost, because pairwise difference is not equivalent to difference amongs all pairs. If you scale this pairwise experiment to many pairs, you may find a pair that is different solely by accident. To put it in practical terms: pairwise t-test is not equivalent to ANOVA. Further reading.

I think you were thinking that ABC=ABC/HR, when I meant ABC = ANOVA. 

Gotcha. Still, isn't ANOVA only valid in the context of a variable? I don't quite understand what exactly we can use as a variable for ABX tests. The number of correct responses seems like it would have some bias either for results that are too strong (close to N) or weak (towards the average). Maybe the power of the test is utilized?

That pairwise testing makes for a valid test makes quite a few more tests significant, because some people are doing BC tests. They probably know more statistics than I do.  Still, wouldn't you agree that AB/AC tests are not quite fully significant when naively combined, compared to ABC/ANOVA?

Quote
ABC/HR is a different (sic) matter. Once you prove that you can tell the difference between A & B, then you are free to formulate a preference. I haven't used ABC/HR, but, if I understand it correctly, all it does is ask the subject to rank the differences between A (the original) and B, C, D, ..., and then runs a non-parametric test to determine if the rankings are significant. And you need multiple subjects because you are aiming for a significant result over a population sample. The only attack on validity I see is the fact that subjects used different equipment.
[a href="index.php?act=findpost&pid=251907"][{POST_SNAPBACK}][/a]

I pretty much agree, although again, I don't quite understand the issues you raise with equipment. It's obviously useful for a clinical context, and especially when you need to cap the number of tests per listener - I could see the significance of the results being degraded slightly with varying equipment. But I don't see it as fundamentally eliminating the possibility of statistical results.

Is double-blind testing potentially biased?

Reply #14
Quote
Thus, ABX cannot ever be used to prove that "A & B sound the same", or that "there is no difference between A & B", even if you restrict the statements to the subject being tested. So you can never prove that some codec is transparent by ABX testing. Odd, but true.
[a href="index.php?act=findpost&pid=251898"][{POST_SNAPBACK}][/a]

Lemme try this hypothesis on for size:

If there really is no difference between A & B, then you're still going to get results where 12/16s or what not are produced. They'll be rare, but they will exist. These results, in the context of ABC/HR testing, will follow a predictable distribution down from the peak (representing transparency). Couldn't this be used to construct a mean rating, and couldn't an interval be defined around that to produce a noninferiority test? Additionally, wouldn't the standard deviation around that mean change as well (with a large enough sample size) if the encoding was not transparent?

EDIT: now that I think about it more, all that it would seem is necessary to use is the proportion of correct responses, which is what I assume gaboo was talking about in the first place. I see how this could be used to derive a mean even for a bare-bones ABX test.

Is double-blind testing potentially biased?

Reply #15
Quote
Quote
Suppose you decide to ABX  lossless vs. Musepack --insane on a random sample. You know MPC -q7 is very likely to be transparent on most samples and on most people. You start the first trial and find no clear artifact you can focus on. Knowing the high transparency of the sample, after a couple tries on different passages you are likely to say "Nah, it's transparent" and give up.


This is where you are wrong. I clearly stated above that in an ABX experiment, you can only "prove" p="there is a difference", but you cannot prove non-p if the experiment fails. In statistical "logic" lack of significance for p does NOT imply statistical significance for non-p, it may simply mean that there's another hypotesis that is true, e.g.: "the subject was drunk", or "the subject didn't bother to try". You'll need a difference experiment to prove non-p = "there is no difference".


You're missing the point. I am aware of the statistical caveats of the test and what you can or cannot prove with it. However, my post wasn't really about the methodology or statistical significance issues. My point is that previous knowledge (of what is being tested) generates expectations, and expectations can influence tester behaviour. Different behaviour when it ideally should be the same introduces error. The error may or may not be important. That's all. When did I ever suggest that you can "prove" non-p?

Quote
Quote
Now you ABX lossless vs. LAME -V4. For most regular folks here (suppose you are one), -V4 is going to sound pretty good, probably transparent most of the time, but you know the likelihood of (slight) artifacting on some passage is still somewhat high. Therefore, you look more for potentially problematic passages and concentrate harder on finding differences, since everybody likes to be able to say "Yeah, I ABXed this and that".
Thus, conditions that are expected/hoped to be ABXable could possibly be overrepresented due to increased effort. I think the global effect of this is not important, but the bias is still there.
[a href="index.php?act=findpost&pid=251830"][{POST_SNAPBACK}][/a]


The test you described: (1) AB of A vs. B, then (2) AB A vs. C is flawed because you know when you that in test (1) there's no C, and in test (2) there's no B. You cannot "add" two AB tests like that and get a valid ABC test. The "compound" test is NOT blind, so it is biased.
[a href="index.php?act=findpost&pid=251892"][{POST_SNAPBACK}][/a]

Again... what?  When did I say my example was meant to describe a "compound test"? It was never my intention to draw any conclusions about the relationship between B and C. The tests can be completely independant. Again, the exemplication was just meant to illustrate how expectations may affect the results of individual ABX tests.

Is double-blind testing potentially biased?

Reply #16
Quote
Quote
Quote
Suppose you decide to ABX  lossless vs. Musepack --insane on a random sample. You know MPC -q7 is very likely to be transparent on most samples and on most people. You start the first trial and find no clear artifact you can focus on. Knowing the high transparency of the sample, after a couple tries on different passages you are likely to say "Nah, it's transparent" and give up.


This is where you are wrong. I clearly stated above that in an ABX experiment, you can only "prove" p="there is a difference", but you cannot prove non-p if the experiment fails. In statistical "logic" lack of significance for p does NOT imply statistical significance for non-p, it may simply mean that there's another hypotesis that is true, e.g.: "the subject was drunk", or "the subject didn't bother to try". You'll need a difference experiment to prove non-p = "there is no difference".


You're missing the point. I am aware of the statistical caveats of the test and what you can or cannot prove with it. However, my post wasn't really about the methodology or statistical significance issues. My point is that previous knowledge (of what is being tested) generates expectations, and expectations can influence tester behaviour. Different behaviour when it ideally should be the same introduces error. The error may or may not be important. That's all. When did I ever suggest that you can "prove" non-p?


You used the word ABX. If I understand you corectly, you are proposing that we use ABX testing to prove transparency, which is the null-hypothesis in ABX. Unless you use ABX as a substitue for "generic" DBT, you do imply an experiement and a specific statistical test. You cannot use a failed ABX as successful non-inferiority test.

As for the bias induced by the knowledge of the encoder, you are correct: knowledge about the encoder tested does increase type II error if people belive the codec is supposed to be transparent.

Quote
Quote
Quote
Now you ABX lossless vs. LAME -V4. For most regular folks here (suppose you are one), -V4 is going to sound pretty good, probably transparent most of the time, but you know the likelihood of (slight) artifacting on some passage is still somewhat high. Therefore, you look more for potentially problematic passages and concentrate harder on finding differences, since everybody likes to be able to say "Yeah, I ABXed this and that".
Thus, conditions that are expected/hoped to be ABXable could possibly be overrepresented due to increased effort. I think the global effect of this is not important, but the bias is still there.
[{POST_SNAPBACK}][/a]



The test you described: (1) AB of A vs. B, then (2) AB A vs. C is flawed because you know when you that in test (1) there's no C, and in test (2) there's no B. You cannot "add" two AB tests like that and get a valid ABC test. The "compound" test is NOT blind, so it is biased.
[a href="index.php?act=findpost&pid=251892"][{POST_SNAPBACK}][/a]

Again... what?  When did I say my example was meant to describe a "compound test"? It was never my intention to draw any conclusions about the relationship between B and C. The tests can be completely independant. Again, the exemplication was just meant to illustrate how expectations may affect the results of individual ABX tests.
[a href="index.php?act=findpost&pid=251977"][{POST_SNAPBACK}][/a]


Well, the caveat is that someone less knowledgeable about statistics may look at the independent test results and draw such conclusions (i.e. B vs C).

If you bother to make both experiments, why not perform a test that does allow conclusion across all A, B & C? Kruskal-Wallis test is a non-parametric test that does just that. Friedman's test also accounts for repeated measurements of the same individual. But these only test the universal null hypothesis, they are not pairwise testS. If you want to push it further and test pairwise differece, there are bunch of tests you can do. I see this topic has been debated here on this [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=73&hl=friedman]thead[/url], so I won't regurgitate it.

For the inquireing minds reading this thread: you can compensate a pairwise null test into a universal null test by strengthening the required p-value in each test, the procedure is called Bonferroni adjustment.
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #17
Quote
Quote
Thus, ABX cannot ever be used to prove that "A & B sound the same", or that "there is no difference between A & B", even if you restrict the statements to the subject being tested. So you can never prove that some codec is transparent by ABX testing. Odd, but true.
[a href="index.php?act=findpost&pid=251898"][{POST_SNAPBACK}][/a]

Lemme try this hypothesis on for size:

If there really is no difference between A & B, then you're still going to get results where 12/16s or what not are produced. They'll be rare, but they will exist. These results, in the context of ABC/HR testing, will follow a predictable distribution down from the peak (representing transparency). Couldn't this be used to construct a mean rating, and couldn't an interval be defined around that to produce a noninferiority test? Additionally, wouldn't the standard deviation around that mean change as well (with a large enough sample size) if the encoding was not transparent?

EDIT: now that I think about it more, all that it would seem is necessary to use is the proportion of correct responses, which is what I assume gaboo was talking about in the first place. I see how this could be used to derive a mean even for a bare-bones ABX test.
[a href="index.php?act=findpost&pid=251912"][{POST_SNAPBACK}][/a]


Axon, I'm an engineer, thus no expert on perceptual testing. The way I see it, the main problem in setting up a non-inferiority perceptual test is exactly the fact that the effort required to get a "1" bit (i.e. different) is greater than the effort to get a "0" bit. You can get a bunch of deaf people to declare than any two encoders are equal. 

Maybe ff123 (who read at least one relevant book) has an idea how do this...

I do know a Ph.D. candidate in sociology, I'll ask for outside help if all else fails. 
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #18
OT from the original thread, but:

Somehow, there's this idea that "you can't [statistically] prove that something sounds the same as another."

Yet, the food industry does stuff like this all the time, except using taste instead of hearing.  For example, if they change the formula of a product (new Coke comes to mind), they want to know if it will be detectable.  So they choose an acceptable proportion of the population, pd, who are allowed to be able to detect the difference, set the type I and II error risks, which then tells them how many random people they need to test.

If the results then come out null, they can say with a certain amount of confidence that their product still tastes the same.

With a single person testing multiple times, pd becomes the effect size, or the proportion of times that a person is able to detect a difference, but the idea is the same.  Of course, the results of 1 person only apply to that person.

ff123

Is double-blind testing potentially biased?

Reply #19
Before I get my ass flamed, I rush to specify that by ABX in the posts above I mean theta=1 ABX (as was originally implemented). I saw today that ff123 has a new ABX where you can set theta to any desired value.
The earth is round (P < 0.05).  -- Cohen J., 1994

Is double-blind testing potentially biased?

Reply #20
Quote
You used the word ABX. If I understand you corectly, you are proposing that we use ABX testing to prove transparency, which is the null-hypothesis in ABX. Unless you use ABX as a substitue for "generic" DBT, you do imply an experiement and a specific statistical test. You cannot use a failed ABX as successful non-inferiority test.

I never proposed anything. That was just an example of how people might react with a priori knowledge of the codecs being tested. The hypothetical guy who exclaims "Nah, it's transparent" while giving up the test should not be taken as a formal experimenter of perceptual audio coding. It's just a regular HA.org reader who wants to know what settings are transparent enough for his tastes. After his failed attempt, he just considers the sample transparent [to him] and gives up trying to find artifacts. Sure, he didn't prove statistically the non-inferiority of the codec, but he doesn't care, since he is convinced he cannot hear a difference (and so far, hasn't).

Is double-blind testing potentially biased?

Reply #21
Quote
Quote
You used the word ABX. If I understand you corectly, you are proposing that we use ABX testing to prove transparency, which is the null-hypothesis in ABX. Unless you use ABX as a substitue for "generic" DBT, you do imply an experiement and a specific statistical test. You cannot use a failed ABX as successful non-inferiority test.

I never proposed anything. That was just an example of how people might react with a priori knowledge of the codecs being tested. The hypothetical guy who exclaims "Nah, it's transparent" while giving up the test should not be taken as a formal experimenter of perceptual audio coding. It's just a regular HA.org reader who wants to know what settings are transparent enough for his tastes. After his failed attempt, he just considers the sample transparent [to him] and gives up trying to find artifacts. Sure, he didn't prove statistically the non-inferiority of the codec, but he doesn't care, since he is convinced he cannot hear a difference (and so far, hasn't).
[a href="index.php?act=findpost&pid=252283"][{POST_SNAPBACK}][/a]

Gaaaah! You guys are saying the same thing  I think the point gaboo is trying to make by being so pedantic is that, in a strictly statistical sense, none of us are really "testing for transparency", and in fact, very few stringent tests for transparency have been done on HA.

I think that is a very important outstanding problem which has no place in this topic.  This whole side discussion on what constitutes a "proper" transparency test is fascinating, and it's apparantly rather unexplored territory.

In any case, it is more or less concluded: combining two ABX test together, between two different encoders, does not yield statistically valid results and will lead to increased type II error. However, if a 3-way pairwise blind test is done (ABX, ACX, BCX), and the results of BCX are taken into account, then the results are a lot more valid.