QUOTE(Halcyon @ Jun 15 2004, 12:22 AM)
Excellent work ff123!
You are working your way there much faster than I am.
Your argumentation on the rao was also illuminating to read as you work your way through the logic step by step. It is very unfortunate to notice that AK, who is gung-ho on statistics initially, starts to move goal posts when his arguments are proven to be misguided (eventually claiming the discussion is only about statistics). That again is unfortunate, because the discussion is about improving the accuracy of statistical inference and the implications that has for practical testing! It's not just developing the methods for "mental masturbation" as somebody put it.
I knew going in that convincing Arny to change PC-ABX would be a lost cause. But to see such disregard for the valid criticisms pointed out was stunning.
His general recommendation of 14/16 is looking for a p-value of 0.002! This is a ridiculously low value which raises the bar unfairly against the detection of very subtle differences.
Arny's assertion that more trials don't help to hear subtler differences, which he claims to base on experience, is contradicted by the fact that he even recommends 14/16 at all. Why not just recommend 9/9 trials instead? That would give the same p-value of 0.002 (ignoring the non-standard criterion of significance), without the need for an extra 7 trials. The answer is that 16 trials is better than 9 trials because one might make mistakes. Why would somebody make mistakes? Because the differences are subtle! He fails to carry the argument all the way through, stopping at the point where he and his cohorts in the 70's decided that even subtler differences were unimportant.
I should point out that in tests for similarity (according to Sensory Evaluation Techniques), small values of theta would be 0.625 or less. They assume a large number of different testers instead of an individual tester performing many trials (the theta is different between the two types of tests, which Corbett pointed out).
An example, ignoring the theta transformation:
Let's say I want to show that two cables sound the same with a type II error risk of 0.05, and a theta of 0.625. To control N somewhat, I'll allow the type I error risk to rise up to 0.2. That still calls for an N of about 100! No wonder Arny doesn't highlight the statistics. He wouldn't be able carry out his vendetta against "snake oil" with such intensity if his victims knew what it takes to really show that "there is no difference" with confidence.
His point about controlling fatigue is valid, but nobody said that many trials have to be performed in one sitting.
QUOTE
I'm not sure if you looked up the two-tailed test by Leventhal yet, but it shines a little more light on the same issue (in terms of methods), imho.
What was the reference again? I looked up two articles by Burstein, which John Corbett recommended, but not the Leventhal.
QUOTE
PS What are you using as statistics references? I've been recommended the following two as up-to-date books (but have not gotten them yet):
http://tinyurl.com/3dhl5http://tinyurl.com/2a2ycStill using Sensory Evaluation Techniques, which has contained all of the information on both test techniques and statistics I've needed to date.