IPB

Welcome Guest ( Log In | Register )

2 Pages V   1 2 >  
Reply to this topicStart new topic
Need help with a Stereophile article, An ABX debate that I want to understand.
sthayashi
post Jul 26 2004, 20:27
Post #1





Group: Members
Posts: 494
Joined: 16-April 03
From: Pittsburgh, PA
Member No.: 5997



Stereophile's article talks about the problems with ABX'ing. Now I'm a big fan of the ABX test and draw a healthy amount of skepticsm whenever such a test is NOT performed (as HA has taught me).

But on this page, they talk about Type 2 errors. Now the author doesn't explain how he derived these numbers and claims that Type 2 errors are the probability that an audible differerence existed but was not heard.

I don't understand what he's talking about, and I don't trust the fact that he references himself.

WARNING: This article is a pain to read, but I think all Stereophile articles are like that.
Go to the top of the page
+Quote Post
ff123
post Jul 26 2004, 21:23
Post #2


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (sthayashi @ Jul 26 2004, 11:27 AM)
Stereophile's article talks about the problems with ABX'ing.  Now I'm a big fan of the ABX test and draw a healthy amount of skepticsm whenever such a test is NOT performed (as HA has taught me).

But on this page, they talk about Type 2 errors.  Now the author doesn't explain how he derived these numbers and claims that Type 2 errors are the probability that an audible differerence existed but was not heard.

I don't understand what he's talking about, and I don't trust the fact that he references himself.

WARNING: This article is a pain to read, but I think all Stereophile articles are like that.
*


You can download a spreadsheet I made which shows how type 2 errors are calculated here:

http://ff123.net/export/TestSensitivityAnalyzer.xls

A type II error is the probability that, given an effect which is known to be audible to a certain percent of the population (or equivalently, audible a certain percent of the time), an ABX test will mistakenly declare that there is no audible difference. The type II error is dependent on the size of the effect, as well as on the type I error risk, and the number of listeners (or number of trials).

In short, it's the risk of declaring no difference when there actually is one.

I have created an ABX application which includes type II risk, but have not finished polishing it up (eliminating all the minor bugs) here:

http://ff123.net/export/abchr1.1beta2.zip

Setting up the various type I and type II risks prior to an ABX test greatly depends on the goals of that test. If one wants to really make sure that he doesn't falsely declare "no difference," he needs to make sure that he chooses a low-enough effect size and a low-enough beta (type II risk). The value chosen for alpha (type I risk) also makes a difference. Typically for very small effects, the required number of listeners or trials is very large. This has to be balanced against listener fatigue.

For codec testing, we don't really care about "no difference" errors. So the number of trials can be reasonable.

ff123
Go to the top of the page
+Quote Post
boojum
post Jul 26 2004, 22:01
Post #3





Group: Members (Donating)
Posts: 819
Joined: 8-November 02
From: Astoria, OR
Member No.: 3727



There is a Latin expression to the effect of "Cuit bonit?", I believe. My Latin is rusty. It translates as, "Who benefits?" When reading the Stereophile it is good to remember that they support high-end systems and declare that for $10,000 apiece some amplifiers are better than those for $500. There is little or no ABX'ing of these claims. They have no reason to persuade readers that differences are measureable or quantifiable. A less charitable way of saying this is that they are snake-oil salesmen.

Bob Carver was able to tweak a transistor amp he built well enough that not one person at that magazine was able to tell it apart from a high end, Conrad-Johnson, I think, tube amp. So much for their golden ears and objectivity. When they have to put up or shut up they fall flat.

That is just my opnion, FWIW. cool.gif


--------------------
Nov schmoz kapop.
Go to the top of the page
+Quote Post
mithrandir
post Jul 27 2004, 02:16
Post #4





Group: Members
Posts: 669
Joined: 15-January 02
From: SE Pennsylvania
Member No.: 1032



Type II errors can be rather large in the criminal justice system. For instance, "society" generally considers it more upsetting to find an innocent person guilty (type I error) than to let a guilty person go free. That's why you'll hear things like "reasonable doubt". We bear type II errors because we want to protect the innocent, at the expense of letting some criminals slip through. I think the infamous O.J. Simpson case applies here.
Go to the top of the page
+Quote Post
kennedyb4
post Jul 27 2004, 03:18
Post #5





Group: Members
Posts: 772
Joined: 3-October 01
Member No.: 180



QUOTE (boojum @ Jul 26 2004, 04:01 PM)
So much for their golden ears and objectivity.  When they have to put up or shut up they fall flat.

That is just my opnion, FWIW.    cool.gif
*


Big time. The last thing Conrad Johnston, Mark Levinson etc etc want is objective analysis amongst reviewers.

They need to sell their product, bottom line.
Go to the top of the page
+Quote Post
Cygnus X1
post Jul 27 2004, 03:30
Post #6





Group: Members (Donating)
Posts: 676
Joined: 5-June 02
From: New York
Member No.: 2224



I'd personally be more worried about alpha inflation with sequential ABX tests than I would with Beta (Type II) errors. People seem to forget that when you repeat the exact same test in sequence, the Type I error rate increases, so that two sequential tests with a significance cutoff of p<.05 will become p<.10, which is no longer statistically significant at the 95% significance level. I think this warrants more attention than the possibility of committing a Type II error wink.gif

Just my statistical 2 cents.

This post has been edited by Cygnus X1: Jul 27 2004, 03:34
Go to the top of the page
+Quote Post
sthayashi
post Jul 27 2004, 05:56
Post #7





Group: Members
Posts: 494
Joined: 16-April 03
From: Pittsburgh, PA
Member No.: 5997



ff123, that's a nice spreadsheet you have there. But in answering my question, you raised another one for me. What role does the proportion distinguisher (pd) play? How do we determine that value for a real life situation?

Sorry for bringing up all these questions that should have been answered if I had actually stayed awake in my probability distributions class (I got a C, but I maintain that the professor was confusing). The reason I'm asking is two-fold. First, in the interest and pursuit of the actual truth, a skeptic must be prepared to address all reasonable arguments, no matter how biased the source.

Second, I got into a cable quality debate with a friend online and I plan on testing his ears in the future (the date is not yet determined) on the difference between his silver cables and some thick copper cables that I'll pick up on the way to his place. But I want to ensure that I minimize the chance of error while simeoultaneously keeping the test(ing) reasonable. Beta-errors are something I had never heard of before and had not considered.
Go to the top of the page
+Quote Post
ff123
post Jul 27 2004, 06:29
Post #8


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (sthayashi @ Jul 26 2004, 08:56 PM)
ff123, that's a nice spreadsheet you have there.  But in answering my question, you raised another one for me.  What role does the proportion distinguisher (pd) play?  How do we determine that value for a real life situation?


The rule of thumb given by Sensory Evaluation Techniques is:
pd < 25% represent small values;
25% < pd < 35% represent medium-sized values; and
pd > 35% represent large values

When you plug in small values for pd, you'll see that humongous amounts of listeners (or trials) can be required to achieve a low type II risk, even when the type I risk is allowed to rise.

Here is a real-world example from the food industry:

"A food company wants to test for similarity of the blended table syrup produced with corn syrups from the current and alternate suppliers. The sensory analyst and the project director note that to obtain maximum protection against falsely concluding similarity, for example by setting beta at 0.1% (i.e., beta = 0.001) relative to the alternative hypothesis, that the true proportion of the population able to detect a difference between the samples is at least 20% (i.e. pd = 0.20), then to preserve a modest alpha risk of 0.10 they need to have at least 260 assessors. They decide to compromise at alpha = 0.20, beta = 0.01, and pd = 30% which requires 64 assessors."

QUOTE
Sorry for bringing up all these questions that should have been answered if I had actually stayed awake in my probability distributions class (I got a C, but I maintain that the professor was confusing).  The reason I'm asking is two-fold.  First, in the interest and pursuit of the actual truth, a skeptic must be prepared to address all reasonable arguments, no matter how biased the source.

Second, I got into a cable quality debate with a friend online and I plan on testing his ears in the future (the date is not yet determined) on the difference between his silver cables and some thick copper cables that I'll pick up on the way to his place.  But I want to ensure that I minimize the chance of error while simeoultaneously keeping the test(ing) reasonable.  Beta-errors are something I had never heard of before and had not considered.
*


If your friend is up on statistics, he should insist on a large number of trials to keep the type II error down and to allow for small effects.

ff123

Edit: Note that proportion of distinguishers (i.e., pd) for a test involving multiple listeners is not exactly the same as the percent of the time that an effect can be heard by an individual listener using multiple trials. There is a conversion formula. But for the sake of simplicity, I would just assume that the two things are equivalent.

This post has been edited by ff123: Jul 27 2004, 06:31
Go to the top of the page
+Quote Post
CSMR
post Jul 29 2004, 01:28
Post #9





Group: Members
Posts: 758
Joined: 10-May 04
Member No.: 14009



QUOTE (boojum @ Jul 26 2004, 01:01 PM)
There is a Latin expression to the effect of "Cuit bonit?", I believe.  My Latin is rusty. It translates as, "Who benefits?"

Cui bono, to whose good

Very educated discussion there at Stereophile
Go to the top of the page
+Quote Post
krabapple
post Aug 2 2004, 23:53
Post #10





Group: Members
Posts: 2157
Joined: 18-December 03
Member No.: 10538



QUOTE (ff123 @ Jul 26 2004, 09:29 PM)
If your friend is up on statistics, he should insist on a large number of trials to keep the type II error down and to allow for small effects.



I agree in principle, but the reality is that 'golden ears' have often already claimed to be able to easily hear a difference between 'A' and 'B'. Just read any review in any high-end mag -- phrases like ' it was like a veil was lifted' or 'suddenly it all came into focus' or 'it was no contest' -- indicators of *big* subjective difference. -- are not uncommon.

So in this case, how many trials are necessary not to confidently *dispense with* the original claim, but to simply cast it into doubt...to shake his confidence? I predict that his friend will have no problem identifying his cables by 'sound' 100% of the time when he knows which ones are in the circuit. And that after a small number of trials under blinded conditions, it will be apparent to his friend that 'sightedness' has a biasing role in reports of audible difference.

Many 'objectivists' started out as 'subjectivists' who had a humbling experience. A common one is the 'phantom switch' -- it happened to me. I thought I had changed something (threw a switch); the difference was obvious to my ears; later I realized that I had actually accidentally *failed* to make the switch , and had been listening to the exact same thing all along.

For an intellectually honest person, that sort of experience should be enough to dash excessive confidence in sighted listening perceptions. I find it hard to believe that at least some of Stereophile's staff *hasn't* had such experiences.
Go to the top of the page
+Quote Post
krabapple
post Aug 2 2004, 23:55
Post #11





Group: Members
Posts: 2157
Joined: 18-December 03
Member No.: 10538



QUOTE (CSMR @ Jul 28 2004, 04:28 PM)
QUOTE (boojum @ Jul 26 2004, 01:01 PM)
There is a Latin expression to the effect of "Cuit bonit?", I believe.  My Latin is rusty. It translates as, "Who benefits?"

Cui bono, to whose good

Very educated discussion there at Stereophile
*




In modern parlance: follow the money.


laugh.gif

This post has been edited by krabapple: Aug 2 2004, 23:55
Go to the top of the page
+Quote Post
jaustin
post Aug 3 2004, 19:41
Post #12





Group: Members
Posts: 39
Joined: 26-July 04
Member No.: 15785



QUOTE (krabapple @ Aug 2 2004, 05:53 PM)
For an intellectually honest person, that sort of experience should be enough to dash excessive confidence in sighted listening perceptions. I find it hard to believe that at least some of Stereophile's staff *hasn't* had such experiences.
*

Not sure if I qualify...I'm an occassional freelance contributor to Stereophile, but I don't do component reviews, and I've never written that "it was like a veil was lifted" or anything like that. Never had the opportunity to.

Yes, I've had experiences like the ones you describe, and I'm willing to bet that most reviewers have, too. And, yes, such an experince should dash excessive confidence. But having one's excessive confidence dashed doesn't necessarily require jumping the subjectivist ship and swimming to the objectivist shore.The psychology of subjective experiences like listening is complex, but so is the science of audio. Quite possibly, there's a lot more going on there than we're able to explain with the simplest scientific models. Keeping an open mind can make you vulnerable to all sorts of psychological bullshit. But relying entirely on objective methods can close your mind to a very wide range of very real human experiences. This objectivist/subjectivist phenomenon is, I believe, more interesting than you make it out to be.

Yeah, I think the folks at Stereophile--the reviewers--ought to do a little ABXing every now and then, just to keep themselves honest. But, though ABXing may be necessary to PROVE what you hear, it isn't necessary to KNOW what you hear, at least not for a well-trained, adequately self-critical reviewer. Do the ones at Stereophile meet that criterion? I dunno. Some yes, some no, I suspect.

Jim
Go to the top of the page
+Quote Post
Audible!
post Aug 4 2004, 03:48
Post #13





Group: Members
Posts: 523
Joined: 28-June 03
From: CA, USA
Member No.: 7426



QUOTE (jaustin)
Yeah, I think the folks at Stereophile--the reviewers--ought to do a little ABXing every now and then, just to keep themselves honest. But, though ABXing may be necessary to PROVE what you hear, it isn't necessary to KNOW what you hear, at least not for a well-trained, adequately self-critical reviewer.


In my opinion you have understated the case.
Differences that are claimed to be stunningly obvious (like the "veil" comment, and a large number of others I've read at Stereophile, especially regarding speaker wire and high-end amplifiers) by reviewers of esoteric equipment need to be verified objectively if at all possible if a reviewer is to be taken seriously.

If a sound quality difference is claimed to be very substantial or dramatic, it should be
rather trivial to objectively verify via an A/B/X test. If such differences are not possible to ABX successfully those percieved differences should not be claimed to be dramatic as a consequence.

That is, if ostensibly dramatic differences cannot be detected in a statistically significant fashion by the reviewer sight unseen then such a reviewer has no business claiming that the differences they can "hear" only when they know which is which are somehow dramatic.

Any individual that claims such nonetheless is simply not "adequately self-critical". Being self-critical does not prevent one from thinking one hears something one does not. Being adequately self-critical in my view would be to test subjective impressions in any and every feasible objective manner, every single time.

And then report both the subjective impressions and the ABX results.

I cannot help but suspect that the manufacturers who make $1000 a meter speaker cable would be less inclined to send a publication sample product if such methods were rigorously implemented.

This post has been edited by Audible!: Aug 4 2004, 03:49
Go to the top of the page
+Quote Post
jaustin
post Aug 4 2004, 04:04
Post #14





Group: Members
Posts: 39
Joined: 26-July 04
Member No.: 15785



QUOTE (Audible! @ Aug 3 2004, 09:48 PM)
Differences that are claimed to be stunningly obvious (like the "veil" comment, and a large number of others I've read at Stereophile, especially regarding speaker wire and high-end amplifiers) by reviewers of esoteric equipment need to be verified objectively if at all possible if a reviewer is to be taken seriously.

I think that's a completely reasonable opinion, but obviously it's one the editors and reviewers of Stereophile--and their readers--don't share. Clearly they won't convince you with their methods--haven't so far, anyway--and I'm sure they're okay with that.

QUOTE
If a sound quality difference is claimed to be very substantial or dramatic, it should be rather trivial to objectively verify via an A/B/X test.

Logistically, ABX tests on real-world components are far from trivial. Statistically valid tests are extremely time-consuming and a royal pain in the ass. When you're switching our wires you don't have the luxury of hearing two "samples", repeatedly, within seconds--or fractions of seconds--of each other. Apart from the time delay, with most components there's the issue of level matching, which typically requires an even longer delay between "samples." So it's not just a matter of logical difficulty--the time delay between "samples" presents serious methodological difficulties. And you have to have a friend. Even a relatively high-profile mag like Stereophile or TAS probably doesn't have the resources to support such tests. And even if they did, they probably wouldn't be able to get any reviewers---with audio knowledge and writing skill--to work for them.

ABX with real-world components is logically trivial, but it's a logistical nightmare. Most reviewers choose other methods because, given these real-world constraints, the other methods are (they believe) more effective. And, as proving the validity of their tests to skeptical readers isn't their goal, there's little to be said (in that context) for ABX tests.

QUOTE
Being adequately self-critical in my view would be to test subjective impressions in any and every feasible objective manner, every single time. And then report both the subjective impressions and the ABX results.

Logically it's a fine idea, but practically it's prohibitive.

Cheers,
Jim
Go to the top of the page
+Quote Post
Audible!
post Aug 4 2004, 04:49
Post #15





Group: Members
Posts: 523
Joined: 28-June 03
From: CA, USA
Member No.: 7426



QUOTE (jaustin)
I think that's a completely reasonable opinion, but obviously it's one the editors and reviewers of Stereophile--and their readers--don't share. Clearly they won't convince you with their methods--haven't so far, anyway--and I'm sure they're okay with that.


I think you have a fine career ahead of you as a politician should you choose it smile.gif

I also think Stereophile's editors know where their advertising revenue comes from. Additionally I think they know their readership would drop precipitously if their objective measurements repeatedly did not support their subjective claims.

QUOTE
ABX with real-world components is logically trivial, but it's a logistical nightmare. Most reviewers choose other methods because, given these real-world constraints, the other methods are (they believe) more effective.


You have misinterpreted my meaning here.
I was not referring to the difficulty of setting up a physical ABX system for the components, but of the difficulty involved in actually detecting the differences which were previously subjectively "obvious" or "dramatic" or "like night and day". If the differences are obvious, dramatic or like night and day, then they should be readily ABXable. My experience shows that subjective impressions often disappear when subjected to ABXing, though this is probably not a result of the objective methodology.

When reviewers state the audible differences between products are obvious and dramatic, but then cannot detect those differences in ABX testing, the differences are simply not obvious and dramatic, and anyone claiming they are is selling smoke or smoking what may be illegal to sell.

Regarding logistical difficulty I have no doubt there are problems involved.

However, once assembled, the project ABX box does not appear to be a "logistical nightmare" to me, and I believe it features level matching capabilities:

QUOTE (Part 2: The Manual Remote and its Operating Procedure)
c)
On the controller box, mute the output using the switch provided. Connect a voltmeter to the calibration terminals. Insert the test tone CD, play a 0dB calibration tone and note the voltage. Use SW1 on the remote to change to the other channel and adjust the volume so that the same voltage is displayed. If stepped volume controls are in use, you might need to trim using VR1 on the controller. (Set VR1 to the maximum before calibrating and use it to attenuate the voltage.) Once the voltage is adjusted, remove the test-tone CD and replace the music CD and turn off the muting switch. Do not touch the volume controls again for the duration of the test.


I find the editors notes on the project ABX box to be particularly relevant:
QUOTE
All in all, this is an ambitious project, but one that every hi-fi reviewer should make (or have made) - I expect that if this were done, a great many of the glowing reviews we currently see would diminish.  They may even vanish altogether.

Needless to say, the tester can be also used to verify that the expensive capacitors you bought really don't make any difference, or that all well constructed interconnects sound the same.  This is all very confronting, but it is necessary if we are to get hi-fi back on track, and eliminate the snake oil.


Snake oil is a good term to use here given the historical context.
Does actively shielded speaker cable also solve Dropsy, the Gout and "feminine" and "male problems"?

Probably! biggrin.gif

This post has been edited by Audible!: Aug 4 2004, 05:39
Go to the top of the page
+Quote Post
Pio2001
post Aug 4 2004, 11:44
Post #16


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



If the goal is to test modulation cable worth 1000 $ per meter, it rules out the use of the ABX box, unless all its component are themselves of much higher quality that 1000 $/m cable, in order for it to be transparent compared to the tested cable (or plugs, or other devices).

It would be the same thing as uploading all our test samples encoded in Vorbis 128 kbps, since it is transparent most of the time. However, we require lossless originals.
In the same way, the most basic rigor in modulation cable testing would be to plug the cables from the source to the ampli without any switch in the path of the signal.

The test should also be double blind, which increases the difficulty to an insane level. Basically, double blind means that there should be no operator in the room who knows what source is being played ! Which leaves two solutions. Getting the listeners out of the room while the sources are switched, and getting the oprators out when the listeners are in, or prepare enough complete identical listening systems (source + ampli + speakers) in one or more room, and leave the listeners with them to guess which one is which.
Go to the top of the page
+Quote Post
2Bdecided
post Aug 4 2004, 13:17
Post #17


ReplayGain developer


Group: Developer
Posts: 4945
Joined: 5-November 01
From: Yorkshire, UK
Member No.: 409



The discussion of type 1 and type 2 errors is interesting.

However, I wonder if it's strictly fair to apply them to ABX testing in the way described in stereophile.

Imagine that the difference between A and B is so subtle that it will only be picked out, on average, 60% of the time. That's only 10% better than chance. This gives a possibility of missing it (in a typical Hydrogen Audio ABX text) which sounds quite high, and apparently justifies why people miss subtle details in ABX tests that they would discover in non-blind tests.

However, if something is so subtle that, on average, you'd only pick it out 10% more than chance, anyone who has done a software ABX test knows that they actually do several mini ABX tests before deciding. You click A, you click B, you click X. You're not sure, so you click A and B again. You decide you've chosen a lousy part of the file for comparison so you move the slider. You try again. Then, when you finally feel sure, you click "X is A" or "X is B". That's 1 decision. Then you make the next 15 in a similar manner.

So your "60% chance" sample (p=.6) actually gives much better than 60% correct values in ABX. So it's not actually p=.6! The key question is that: for a sample which you can only give correct ABX results for 60% (i.e. where p really dos equal .6), what is the chance of hearing the difference during normal listening. That's the million dollar question. Objectivists would say it's very small, subjectivists would say it's larger!

Interesting read...

Cheers,
David.
Go to the top of the page
+Quote Post
krabapple
post Aug 4 2004, 19:55
Post #18





Group: Members
Posts: 2157
Joined: 18-December 03
Member No.: 10538



QUOTE (Pio2001 @ Aug 4 2004, 02:44 AM)
If the goal is to test modulation cable worth 1000 $ per meter, it rules out the use of the ABX box, unless all its component are themselves of much higher quality that 1000 $/m cable, in order for it to be transparent compared to the tested cable (or plugs, or other devices).


No, it only requires that the ABX box itself be shown to be transparent when added to the signal chain.
Furthermore, as high-end has shown over and over that stuff like $1000/meter does NOT logically imply 'high quality'. There is no price/quality correlation at that level.

QUOTE
It would be the same thing as uploading all our test samples encoded in Vorbis 128 kbps, since it is transparent most of the time. However, we require lossless originals.


And this is because Vorbis 128 kbps *has* been experimentally shown not to be transparent in some cases. Moreover, from 'first principles' it's reasonable to assume that a lossy codec will introduce an audible artifact. Should we *assume* the same about a well-built ABX box? Audiophiles say yes, because they tend to believe that *everything* makes an audible difference. But is there reasoning from scientific principle or experimenta evidence to support the assumption?

I'd say no, and that the codec analogy is flawed for that reason. The assumption of audible artifact is reasonable for 128 VOrbis, but not for good solid-state switchers.

QUOTE
In the same way, the most basic rigor in modulation cable testing would be to plug the cables from the source to the ampli without any switch in the path of the signal.


It can be done that way too, but the scientific evidence is that if anything you *lose* sensitivity to audible differennce doing things that way, by introducing latency between A and B. Results from such a trial could be attacked on those grounds. (though it would be hypocritical for a 'subjectivist' to do so, since quick-switching is rare in high-end reviews.)


QUOTE
The test should also be double blind, which increases the difficulty to an insane level.


Insane? No. Having the operator leave the room while the source is played is not much of a problem.
Audiophiles go to *far* more insance lengths to achiev ethe 'absolute sound'! Is doing proper DBT any harder than setting up a high-end turntable properly?

Btw, some audio magazines -- Sound & VIsion comes to mind -- *do* run the occasional DBT, and do feature writers who swear by them (e.g. Tom Nousaine, David Ranada).
Go to the top of the page
+Quote Post
krabapple
post Aug 4 2004, 19:59
Post #19





Group: Members
Posts: 2157
Joined: 18-December 03
Member No.: 10538



QUOTE (2Bdecided @ Aug 4 2004, 04:17 AM)
The discussion of type 1 and type 2 errors is interesting.

However, I wonder if it's strictly fair to apply them to ABX testing in the way described in stereophile.

Imagine that the difference between A and B is so subtle that it will only be picked out, on average, 60% of the time. That's only 10% better than chance. This gives a possibility of missing it (in a typical Hydrogen Audio ABX text) which sounds quite high, and apparently justifies why people miss subtle details in ABX tests that they would discover in non-blind tests.

However, if something is so subtle that, on average, you'd only pick it out 10% more than chance, anyone who has done a software ABX test knows that they actually do several mini ABX tests before deciding. You click A, you click B, you click X. You're not sure, so you click A and B again. You decide you've chosen a lousy part of the file for comparison so you move the slider. You try again. Then, when you finally feel sure, you click "X is A" or "X is B". That's 1 decision. Then you make the next 15 in a similar manner.

So your "60% chance" sample (p=.6) actually gives much better than 60% correct values in ABX. So it's not actually p=.6! The key question is that: for a sample which you can only give correct ABX results for 60% (i.e. where p really dos equal .6), what is the chance of hearing the difference during normal listening. That's the million dollar question. Objectivists would say it's very small, subjectivists would say it's larger!

Interesting read...

Cheers,
David.
*


? if the chance of hearing the difference 'for real' is 60%, then that chance is the same whether the listening is done blind or sighted. In proposing your htought experiment, you beg the question of how one knows that the chance of hearing a real difference is 60 % in the first place.
Go to the top of the page
+Quote Post
ff123
post Aug 4 2004, 20:17
Post #20


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (krabapple @ Aug 4 2004, 10:59 AM)
? if the chance of hearing the difference 'for real' is 60%, then that chance is the same whether the listening is done blind or sighted.  In proposing your htought experiment, you beg the question of how one knows that the chance of hearing a real difference is 60 % in the first place.
*


I think David was commenting that the size of the effect (or the chance of an individual hearing a difference) varies depending on how the test is conducted. If you just listen to A, and then to B, and then to X, in that exact order and to a set length of music, that's quite a different situation from being allowed to switch between A, B, or X at will, starting from and ending at arbitrary points inside a particular selection of music. The second situation is benefiting from a mini-training effect, and also from choosing the most difficult section of the music. And that doesn't happen in normal listening.

That line of reasoning implies that ABX'ing the way it's done around here is exaggerating the small defects out of proportion to their real-world relevance. But it can only help those who are trying to get at really small differences.

ff123
Go to the top of the page
+Quote Post
krabapple
post Aug 4 2004, 20:41
Post #21





Group: Members
Posts: 2157
Joined: 18-December 03
Member No.: 10538



QUOTE (ff123 @ Aug 4 2004, 11:17 AM)
That line of reasoning wink.gif  implies that ABX'ing the way it's done around here is exaggerating the small defects out of proportion to their real-world relevance.  But it can only help those who are trying to get at really small differences.

ff123
*


Oh, absolutely. But the Stereophile/TAS perspective is the opposite of that: it's that bias-controlled comparison *masks* differences that have real-world relevance (i.e., reports no differene when there is real difference). Ignoring decades of data on the fallibility of perception, and the need for independent verification of sense impressions, they've *equated* 'real-world relevance' with perception, so the conflict seems inevitable.
Go to the top of the page
+Quote Post
Pio2001
post Aug 4 2004, 21:59
Post #22


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



QUOTE (krabapple @ Aug 4 2004, 07:55 PM)
No,  it only requires that the ABX box itself be shown to be transparent when added to the signal chain.
*


In other words, it requires the test result to be negative !
Testing an expensive cable, we are testing the hypothesis that normal cables and interconnects, like an ABX box, have an effect on the sound.
It is completely flawed to assume that they don't in order to prepare the test !
Go to the top of the page
+Quote Post
Pio2001
post Aug 4 2004, 23:36
Post #23


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



I finally had the time to read all the article.
It points out the importance of adding a recommendation to our guidelines for blind tests.
  • I advise to never run sequencial ABX test, because the result is very difficult to calculate (I don't know if many people agree though, since in the last discussion about it, I standed alone for fixing the number of trials, and Guruboolez standed alone for using the sequencial method).
  • We require to run tests only once (or to take all results into account).
  • In addition, in order to minimize type II errors, we should advise people to give only answers of which they are absolutely certain, whatever time it takes to achieve.
This way, the "p" or "pd" parameter is equal to 1 and the risk of not finding a difference while there is one is null.
Note that, as pointed by Leventhal, in answer to Clarke in page 8, this is not the risk of concluding that no differences exist, but the risk of not concluding that a difference exists.
Go to the top of the page
+Quote Post
krabapple
post Aug 5 2004, 00:56
Post #24





Group: Members
Posts: 2157
Joined: 18-December 03
Member No.: 10538



QUOTE (Pio2001 @ Aug 4 2004, 12:59 PM)
QUOTE (krabapple @ Aug 4 2004, 07:55 PM)
No,  it only requires that the ABX box itself be shown to be transparent when added to the signal chain.
*


In other words, it requires the test result to be negative !


No -- *if* you assume that an audible effect from the ABX box is a reasonable concern, it requires that, *before* you compare the sound of cable A vs cable B when connected to system C, you compare the sound of the ABX box inserted between A and C, versus the sound of A connected directly to C.
(For thoroughness' sake I suppose you could also repeat the test with B and C).

Objective measurements of the signal coming into and out of the ABX box could, of course, also provide evidence for transparency (or not).

QUOTE
Testing an expensive cable, we are testing the hypothesis that normal cables and interconnects, like an ABX box, have an effect on the sound.


...or that the *expensive* cable does! The fact is that a very few expensive cables are patently not transparent -- they are designed to roll off high frequencies, though this isn't revealed in the advertising.

QUOTE
It is completely flawed to assume that they don't in order to prepare the test !
*


At what point do one's assumptions about test conditions stop being flawed? Is it flawed to discount the possible effect of day of the week, cycle of the moon, etc, on the test results? The problem with the 'audiophile' stance is that it's *infiinitely* skeptical when it comes to science, but quite the opposite when it comes to 'personal experience'.

If one wants to reinvent the wheel for every journey, I suppose one is free to do so. That doesn't make the assumption that it's necessary to do so, reasonable. I have indicated that one coudl test the trasnparency of the ABX box itself, both via listening and measurements, before moving on to test the difference between cables, using the ABX box. (The original designers of ABX boxes ran such tests, btw.)

It's interesting to note that for all the hundreds of claims af audible difference between cables, none has published controlled listening test evidence to back them up -- no company has ever offered such results in support of claims for their product. One might think that if their cables were as audibly superior as they claim, that they would. Meanwhile there is a large body of scientific literature (and a small body of controlled cable comparisons) that supports the assumption that audible differences between competently designed cables are more likely figments of the mind, than real.

For your consideration, here are the thoughts of John Dunlavy, noted speaker (and cable) designer, on this topic:

http://www.verber.com/mark/cables.html

This post has been edited by krabapple: Aug 5 2004, 00:57
Go to the top of the page
+Quote Post
ff123
post Aug 5 2004, 01:16
Post #25


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Pio2001 @ Aug 4 2004, 02:36 PM)
I finally had the time to read all the article.
It points out the importance of adding a recommendation to our guidelines for blind tests.
  • I advise to never run sequencial ABX test, because the result is very difficult to calculate (I don't know if many people agree though, since in the last discussion about it, I standed alone for fixing the number of trials, and Guruboolez standed alone for using the sequencial method).


I agree with you (that's why I'm rewriting ABC/hr), but practically speaking I'm not sure it makes all that much difference. The error only gets big for a large number of trials, where we can already infer that the difference, even if it is falsely gotten, must be pretty damn small.

QUOTE
  • We require to run tests only once (or to take all results into account).


  • Yes.

    QUOTE
  • In addition, in order to minimize type II errors, we should advise people to give only answers of which they are absolutely certain, whatever time it takes to achieve.

  • This way, the "p" or "pd" parameter is equal to 1 and the risk of not finding a difference while there is one is null.
    Note that, as pointed by Leventhal, in answer to Clarke in page 8, this is not the risk of concluding that no differences exist, but the risk of not concluding that a difference exists.
    *


    I think for codec testing, we're not really concerned with type II errors, but that's just my opinion. The presets I've put in the new ABC/hr reflect this assumption, however.

    ff123
    Go to the top of the page
    +Quote Post

    2 Pages V   1 2 >
    Reply to this topicStart new topic
    1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
    0 Members:

     



    RSS Lo-Fi Version Time is now: 16th April 2014 - 15:12