ABC/HR rating scale accuracy

2006-04-21 20:57:10

I am far from a statistics expert, and am hoping to learn more about testing multiple codecs. The ABX explanation states that an ABC/HR test is performed by comparing two files A and B to the original C and rating both A and B on a scale from 1 to 5. Based on my naive analysis, it appears the linked Friedman test (I found The Friedman Test for 3 or More Correlated Samples easier to follow) ignores the weights of the rankings and merely records the fact that, say, A was ranked above B. Does anyone have links or references on choosing between the Friedman test and the linked ANOVA test? My first reaction is that relying on the listener to accurately rate the samples on a given scale has problems similar to the ratings voting method (from a site advocating Ranked Pairs, which appears very similar to the Friedman test). Given the blind nature of the ABC/HR test, it would (unless the listener can identify specific codecs) be immune to the intentional gaming found in elections but would seem susceptible to the concern "Consider that a person who tends to see things in extremes, and therefore rates candidates toward the top or bottom of the scale may influence the race between certain candidates many times more powerfully than people who are more measured in their opinions."

Also, assuming the ABC/HR test seeks from each tester a rated (and hence ranked) list of the codecs in the test, where can I find more information on the rationale behind the ABC/HR implementation (testing each codec and original (A and B) in turn against the original © sample). Specifically, I am wondering why not test all codecs against the original at once in an "ABCDEFG...[original] vs. [original]"? Or perhaps I am just misunderstanding and that is in fact what is done? I guess the real question is: is the listener more likely to rate consistently if he can listen to all samples and switch among them at will (for comparison) rather than being asked to rate one at a time?

Thanks

ABC/HR rating scale accuracy

Reply #1 – 2006-04-21 21:13:44

Short answer:

Using anchors is recommended - you mentioned the reasons, i.e. to give the user a reference what is really bad or good, and avoid unreasonable extreme ratings for the items under test.

If you ignore the ratings and use only rankings, you move towards nonparametric statistics, and you'll lose power. There might be good reasons to do that if the test results are not normally distributed, but for most tests this does not appear to be the case.

The test does allow the listener to switch between the samples at will, but only AFTER he identified which one is the encoded sample. The reason is simple: we want to make sure he is actually hearing a difference before we allow him to rate the sample (so the grading isn't done on an imagined difference).

Notice