Should we care about ABX test results?

Topic: Should we care about ABX test results? (Read 46933 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Should we care about ABX test results?

2009-07-28 18:25:12

Hello,

I have just posted an article titled "Should we care about ABX test results?" to my blog. In the article, I describe the limitations of ABX tests as a tool for evaluating lossy compression formats. In a nutshell, I make two arguments:

That the lack of audible differences in an ABX test does not imply that two samples are equally good (ABX tests focus on conscious experience and so fail to take into account subconscious effects).
Conversely, that the presence of slight audible differences does not imply that the compressed sample is inferior (the tacit assumption in ABX tests seems to be that the original sample is always best).

I invite you to read the whole article and offer your comments.

Best wishes,

Tomasz

Should we care about ABX test results?

Reply #1 – 2009-07-28 18:48:04

So you are saying that one possible desirable characteristic of a lossy compression is to make it sound better than the source?

Should we care about ABX test results?

Reply #2 – 2009-07-28 18:54:26

Subliminal effects would be a convincing argument if anybody would show that they actually make a measurable difference in listener response. Your burning-house example can be measured in terms of subject response, but the subliminal tests of eg Oohashi (if they were ever real to begin with; those were never reproduced iirc) could not prove any sort of difference in terms of listener response.

More generally, if such a difference ever manifested itself in a long-term response, in principle, a blind test could still be conducted around that. Such long-term tests have been occasionally conducted, to audiophile demands, and to the best of my knowledge, such results are universally no better than short term tests. In fact, there are extremely good psychoacoustic reasons to believe that rapid switching of stimuli may be more sensitive than long term listening. (Other people here can probably delve into the details of that stuff better than I can though.) So your car example is also not really applicable.

Nobody disputes that lossy distortions could euphonically affect the sound (in fact I've argued that exact point to incredulous audiophiles who swear that all MP3 distortion is bad/harsh/etc). So the weaker point to make - that such distortions may not be euphonic, but they could be perceived as no better or no worse than the original sound, merely "different" - is also pretty trivially conceded.

This gets into the larger questions about the goals of transparency. The "euphonicity" of a distortion is subjective, and what one person may like, another person may dislike. But whether or not the distortion is audible is comparatively quite objective, relying on blind tests and statistics. Just speaking on the basis of the sociology of the audio community here: more people will agree on the need for transparency than a need for the result to "sound good". The latter criterion represented the state of MP3 encoding prior to HA, and that provably led to worse sound (less transparent, more distortion).

If you have obvious distortions (eg low bitrate codecs) then that isn't really an issue and it really does turn into the situation you describe, where different classes of distortion are traded off in different codecs. But you still must require some sort of proof that the listener can actually hear such differences that are ascribed, mandating the use of a blind or hidden-reference test.

At higher bitrates you can think in terms of how often an audible flaw or distortion might get hit with a codec, how bad it is, etc. Along those lines you really can show that MP3 will most likely fail for a specific class of problem samples - thus giving high probability to the idea that a codec like AAC/Vorbis not having those flaws is superior - while simultaneously maintaining transparency for something like 99.9999999% of music out there, and possibly 100% for listeners who cannot ABX those problem samples - thus the claim of MP3 being transparent can still be considered accurate.

Such games are more or less theoretical most of the time. The bigger point to make is that euphonicity is how high bitrate codecs used to be tuned in the internet codec scene prior to HA, and such tunings are provably inferior.

Should we care about ABX test results?

Reply #3 – 2009-07-28 19:00:17

I guess there will be a flood of comments in no time, but after reading the whole article, i believe you either have too much time, or too much worries.

Your first concern: I guess you put the example of car corrosion hand in hand with the listening fatigue. It is not really demonstrated that lossy encodings produce fatigue, and I bet a lossless Metal song will cause more listening fatigue than a lossy file of classical music.

It is obvious that a lossy file is not "as good as" an original for every practical purpose. That's by design. Lossless codecs attain that goal. Lossy encoding is explicitely attaining the purpose of human hearing perception.
A positive ABX, if conducted properly is an undeniable fact of a difference between two inputs (in this case, audio samples). It never says if it "is good enough".
Multiple negative ABX results by multiple subjects on multiple inputs is an *Indication* that a difference, if it exists, is small enough to be negligible. And yes, this is all that matters.

For your second concern: The purpose of lossy encoding is to have the less (human) audible differences between itself and the original.
Denying that (or asking if) the original is the preffered sound is simply out of place. Lossy encoding does not pretend to improve the sound. There are DSP processors for that (apply before or after, at your discretion).
So if there is a difference, necesarily the lossy sample is inferior to the original, since the goal of transparency is not reached.

At last, don't forget ABC/HR. It goes hand in hand with ABX and all the listening tests are based on this tool, not in ABX. Whenever you want to say something more than "I hear a difference", you need this other tool.

I am saying this, because you seem to be looking at what ABX lacks.

[edit:small typo]

Should we care about ABX test results?

Reply #4 – 2009-07-28 19:07:15

You play with the word "good" without defining it clearly in your posts and apparantly your subconscious uses meanings as interested (especially your car example fails on this one).

On the other hand, if we take into account subliminal information we should start doubting too about wav 16 bits/44,1kHz and all other kind of sampling (no matter how high the quality). Maybe vinyl records subconscious sensations better than digital media.

Should we care about ABX test results?

Reply #5 – 2009-07-28 19:07:24

Quote from: tszyn on 2009-07-28 18:25:12

I make two arguments:

That the lack of audible differences in an ABX test does not imply that two samples are equally good (ABX tests focus on conscious experience and so fail to take into account subconscious effects).
Conversely, that the presence of slight audible differences does not imply that the compressed sample is inferior (the tacit assumption in ABX tests seems to be that the original sample is always best).

To begin with the argumentation is fundamentally flawed as ABX is no tool used to distinguish a level of quality - rather to make proof of distinguishibility between sources (not taking into account the difference in percieved subjectivistic "quality").
What you need to look into is AB-testing. I think there's a tool for you named ABC-HR which is useful to your task at hand.
For a start I bid you welcome to the following links;

http://wiki.hydrogenaudio.org/index.php?title=ABX
http://en.wikipedia.org/wiki/Transparency_(data_compression)
http://en.wikipedia.org/wiki/Double-blind#...le-blind_trials

I believe ff123 has made this statement very clear in the past and I'll try to locate this (or these) post(s).

Should we care about ABX test results?

Reply #6 – 2009-07-28 20:24:07

Quote from: David Nordin on 2009-07-28 19:07:24

What you need to look into is AB-testing. I think there's a tool for you named ABC-HR which is useful to your task at hand.

Or MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor). Don't know why this is hardly mentioned in this forum. It arguably has become the standard tool for non-transparent audio codec tests in the scientific world. Similar to ABC-HR, with two major differences:

- at least one (low) anchor must be the original signal with 3.5-kHz lowpass filter applied.
- the grading scale is not according to ITU BS.1116, but defines ranges "excellent", "good", "fair", "poor", and "bad" on a scale from 0 to 100.

Tomasz, regarding point 1 in your blog: that's why we (at least we at Fraunhofer) try to put as many critical items into our tests as we can find. The more critical items you have, the more likely it is to find a difference between two codecs, or between a codec and the original. So your point "That the lack of audible differences in an ABX test does not imply that two samples are equally good" is correct. It depends on the test items. Except maybe when the bitrate of a codec is so high that it is incredibly unlikely that any item will not be transparent, e.g. stereo AAC with 576 kbps.

Haven't read the other points yet

Chris

Should we care about ABX test results?

Reply #7 – 2009-07-28 21:06:30

Questioning "Should we care about ABX test results?" directly leads to "Should we care about double-blind test results?", as ABX tests are double-blind. Questioning the double-blind tests leads to "Should we care about carefully-controlled testing in experimentation?", as double-blind test results are carefully-controlled tests in experimentation. And once we stop caring about carefully controlling our testing during experimentation, we've effectively stopped caring about modern science.

The answer to the is, therefore, obviously "Yes". Answering "no" to that question makes us call into question double-blind test results, which calls into question controlled testing, which calls into question all of modern science.

Now if you want to be a rebel and question all of modern science, go right ahead. I don't have the energy.

Should we care about ABX test results?

Reply #8 – 2009-07-28 21:23:49

Quote from: C.R.Helmrich on 2009-07-28 20:24:07

...try to put as many critical items into our tests as we can find.

You know your Karl Popper!

Do I understand correctly that MUSHRA is targeted at the more profound differences and ABC/HR at the very small ones?

Should we care about ABX test results?

Reply #9 – 2009-07-28 21:44:16

Thank you for your responses. I will comment on the most crucial passages below:

Quote

Axon: Such long-term tests have been occasionally conducted, to audiophile demands, and to the best of my knowledge, such results are universally no better than short term tests.

Thank you for your post. If you could point me to some references, I'd be very interested.

Quote

JAZ: Multiple negative ABX results by multiple subjects on multiple inputs is an *Indication* that a difference, if it exists, is small enough to be negligible. And yes, this is all that matters.

ABX tests only test what they are designed to test, i.e. the perceptibility of a difference. But there could also be imperceptible differences that could affect you. Your statement is like saying that if 10 people taste no difference between a Coke and a Coke with radioactive polonium dust, then there is a negligible difference between the two.

Quote

JAZ: The purpose of lossy encoding is to have the less (human) audible differences between itself and the original. Denying that (or asking if) the original is the preffered sound is simply out of place. Lossy encoding does not pretend to improve the sound.

Okay, so the people who make codecs have chosen transparency as their goal. Is this a good goal? Is it relevant to anyone's happiness? Surely I'm allowed to ask such questions...

Quote

Alexxander: You play with the word "good" without defining it clearly in your posts

Very well. Let's say "good" = "makes people happy".

Some of you have mentioned ABC/HR tests. AFAIK, the way such tests work is that the subject is supposed to judge the "degree of impairment" of compressed samples, so there is an implicit assumption that compressed samples are always worse or equal to the original samples. I would like to see a test where the subject are simply asked "which would you rather listen to - A or B?".

Best,

Tomasz

Should we care about ABX test results?

Reply #10 – 2009-07-28 21:51:41

Quote from: C.R.Helmrich on 2009-07-28 20:24:07

Quote from: David Nordin on 2009-07-28 19:07:24
What you need to look into is AB-testing. I think there's a tool for you named ABC-HR which is useful to your task at hand.

Or MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor). Don't know why this is hardly mentioned in this forum. It arguably has become the standard tool for non-transparent audio codec tests in the scientific world.

MUSHRA attempts to put many variables into one scale of 'good to bad'.

For that reason, I find it a somewhat flawed tool.

Should we care about ABX test results?

Reply #11 – 2009-07-28 21:53:40

Quote from: tszyn on 2009-07-28 21:44:16

Okay, so the people who make codecs have chosen transparency as their goal. Is this a good goal? Is it relevant to anyone's happiness? Surely I'm allowed to ask such questions...

Having a large music collection playback with amazing quality on a device I can easily slip in my pocket makes me happy.

Should we care about ABX test results?

Reply #12 – 2009-07-28 22:00:49

Quote

Failing an ABX test tells you that you are unable to consciously tell the difference between two music samples. It does not mean that the information isn’t in your brain somewhere — it just means that your conscious processes cannot access it.

Firstly, this is an unfounded assumption. You state it as if it were purely factual. Much more data would be required to begin drawing such a conclusion.

Secondly, how do we listen to music or any other kind of audio? Do we listen with our conscious mind or with or unconscious mind? If the answer is the former (which it is), then various unexplainable phenomena occurring in our subconscious mind are irrelevant to how we perceive and experience sound, are they not? Given this, what do ABX tests tell us?

Quote

So the fact that you cannot tell the difference between an MP3 and a CD in an ABX test does not mean that an MP3 is as good as a CD.

Well, obviously. The way you've phrased this, it makes it appear as if the article is blatantly charged.

Should we care about ABX test results?

Reply #13 – 2009-07-28 22:14:32

Quote from: tszyn on 2009-07-28 21:44:16

ABX tests only test what they are designed to test, i.e. the perceptibility of a difference. But there could also be imperceptible differences that could affect you. Your statement is like saying that if 10 people taste no difference between a Coke and a Coke with radioactive polonium dust, then there is a negligible difference between the two.

False comparison. The conclusion of the test would be that there is no difference in the taste, not that there is no difference between the two. A test about taste tests taste, a test about hearing tests hearing. If people can hear no difference, they can hear no difference. If there are other differences that are relevant, I'd like you to tell me what they are and how you have measured them.

Quote

Okay, so the people who make codecs have chosen transparency as their goal. Is this a good goal? Is it relevant to anyone's happiness? Surely I'm allowed to ask such questions...

Yes this is a good goal. Trying to achieve the same sound as the original would seem the most desirable as it gives the most predictability of what the output will or should be. Pre and post-processing are better ways to give the sound more pleasing characteristics.

Should we care about ABX test results?

Reply #14 – 2009-07-28 22:26:27

Quote from: Woodinville on 2009-07-28 21:51:41

MUSHRA attempts to put many variables into one scale of 'good to bad'.
For that reason, I find it a somewhat flawed tool.

Which method would you recommand?

Should we care about ABX test results?

Reply #15 – 2009-07-28 22:33:54

Quote from: Roseval on 2009-07-28 22:26:27

Quote from: Woodinville on 2009-07-28 21:51:41
MUSHRA attempts to put many variables into one scale of 'good to bad'.
For that reason, I find it a somewhat flawed tool.

Which method would you recommand?

If you want to compare preference betwee two sources, do a blind a/b preference test.

If all you care about is difference/none, use ABX,

If you want to ad a rating scale, use ABC/hr.

Should we care about ABX test results?

Reply #16 – 2009-07-29 00:12:51

Quote from: tszyn on 2009-07-28 21:44:16

ABX tests only test what they are designed to test, i.e. the perceptibility of a difference. But there could also be imperceptible differences that could affect you. Your statement is like saying that if 10 people taste no difference between a Coke and a Coke with radioactive polonium dust, then there is a negligible difference between the two.

The above unforunate statement is typical of the OP's entire article. It's the kind of dogmatic posturing that comes up again and again, amd shows a great deal about the improtant things about testing that ABX critics don't seem to know.

The author appears to be saying that ABX tests are somehow specially flawed in that they test only what they are designed to test.

The fact of the matter is that well-designed tests test only what they are designed to test.

So, the OP's criticism of ABX tests on the grounds that they test only what they are designed to test, only serves to show how little the author knows about testing.

Furthermore, the authors singling out of ABX tests as being somehow unique in thus regard again shows what little the author knows about testing, because the world is quite fortunate in that it is full of tests that only test what they are designed to test.

Should we care about ABX test results?

Reply #17 – 2009-07-29 07:40:42

Yeah, ABX tests will tell you if there's an audible (or taste) difference between those two samples. If you wanna know which of those two samples will give you cancer, you need a completely different set of tests!

Should we care about ABX test results?

Reply #18 – 2009-07-29 08:20:23

There are these Gremlins! Oh my god! They always jump in, exactly when I do an ABX test.
So I know, for sure, that I FEEL subconsciously that there is a difference and I know that I can prove it.
But these Gremlins don't let me do it! Oh, all mighty subconscious higher being, please help me!

Should we care about ABX test results?

Reply #19 – 2009-07-29 08:21:23

Oh, yes!
I know. I will use spectral graphs to prove that there is a subconscious difference.

Should we care about ABX test results?

Reply #20 – 2009-07-29 09:09:22

I wonder why as of late we've been pursuaded by so many audiophools that just want to prove that they are right and we're wrong.

In my oppinion this entire topic is in violation with our established concepts mentioned in the TOS, although some may disagree.

Actually I like the coke-example. This is actually what we're expecting when we encode material in *lossy* ways. We don't expect it to be a pure copy of the original, we can live with it if it's composed by toxic materials to make it to be small, portable and easy to use. We just don't want to be able to distinct it from the original - ABX indeed makes this possible, just as a similar test could be done with coke. If they successfully made a toxic coke that was indistinguishable from a regular coke, the goal has been met!

~~Stop crying about how bad lossy encodings may be and use lossless instead if it bothers you.~~ Scratch that, I almost forgot this stupid example as well...

Should we care about ABX test results?

Reply #21 – 2009-07-29 09:22:32

Quote from: Roseval on 2009-07-28 21:23:49

Do I understand correctly that MUSHRA is targeted at the more profound differences and ABC/HR at the very small ones?

Not quite. According to the MUSHRA standard, the ABX/HR (BS.1116) method is better suited for near-transparent codecs, i.e. small differences between codecs and original. MUSHRA is supposed to be better for "intermediate quality", i.e. obvious differences between codecs and original, but small differences between codecs.

Tomasz:

Quote

Okay, so the people who make codecs have chosen transparency as their goal.

Not always true. There are very low-bit-rate codecs (less than 32kbps stereo) which will never be transparent. The are just developed to "sound good" -> MUSHRA territory, I would say.

To make it more confusing now, Fraunhofer (and EBU in some recent multi-channel tests) have also used MUSHRA to test near-transparent codecs. The results were just as reliable. The more important thing is how you conduct the test sessions (how many items to listen to in a row, when during the day the sessions are performed, how long the breaks are in between, how long the test items, etc. etc.)

Chris

Should we care about ABX test results?

Reply #22 – 2009-07-29 15:37:04

Quote from: odyssey on 2009-07-29 09:09:22

I wonder why as of late we've been pursuaded by so many audiophools that just want to prove that they are right and we're wrong.

Have we? (been persuaded, I mean).

Quote

In my oppinion this entire topic is in violation with our established concepts mentioned in the TOS, although some may disagree.

I read the OP's webpage about ABX...it's quite reasonable as these things go, though it makes some speculative leaps and a dubious analogy and appears to grasp at straws. Yet it's hardly Robert Harley redux. And if you poke around the OP's website you'll see he's more HA material than not ..he's done far more investigation into MP3 than most 'audiophiles' for example (and found that 196kps is effectively transparent to him..using ABX tests).

Should we care about ABX test results?

Reply #23 – 2009-07-29 15:55:18

Quote from: C.R.Helmrich on 2009-07-29 09:22:32

Quote from: Roseval on 2009-07-28 21:23:49
Do I understand correctly that MUSHRA is targeted at the more profound differences and ABC/HR at the very small ones?

Not quite. According to the MUSHRA standard, the ABX/HR (BS.1116) method is better suited for near-transparent codecs, i.e. small differences between codecs and original. MUSHRA is supposed to be better for "intermediate quality", i.e. obvious differences between codecs and original, but small differences between codecs.

Tomasz:
Quote
Okay, so the people who make codecs have chosen transparency as their goal.

Not always true. There are very low-bit-rate codecs (less than 32kbps stereo) which will never be transparent. The are just developed to "sound good" -> MUSHRA territory, I would say.

To make it more confusing now, Fraunhofer (and EBU in some recent multi-channel tests) have also used MUSHRA to test near-transparent codecs. The results were just as reliable. The more important thing is how you conduct the test sessions (how many items to listen to in a row, when during the day the sessions are performed, how long the breaks are in between, how long the test items, etc. etc.)

Chris

re: MUSHRA
This paper reviews some possible pitfalls to beware of in performing/interpreting MUSHRA tests.

http://www.aes.org/e-lib/browse.cfm?elib=14393

Should we care about ABX test results?

Reply #24 – 2009-07-30 22:10:45

Quote from: krabapple on 2009-07-29 15:55:18

re: MUSHRA
This paper reviews some possible pitfalls to beware of in performing/interpreting MUSHRA tests.

http://www.aes.org/e-lib/browse.cfm?elib=14393

Thanks for pointing me to that article! Very interesting, not only for MUSHRA users. In case someone is doing a 96kbps test here at HA one day, maybe he/she should re-think about the choice of test items and anchors based on the reports in that article.

Back to the original topic: Tomasz, I like your reference to the "hemispatial neglect" incident. I've done a few MUSHRA tests at very high bitrates at work recently, so high that we expected the codecs under test to be transparent. I allowed the participants to take a guess at which stimulus is a codec under test and which is the hidden reference in case they think they don't hear a difference. Looking at each listener's ratings, it really appeared as if everyone was purely guessing on most of the items (i.e. huge overlaps in confidence intervals between hidden reference and codecs). Averaged over all items and listeners, however, no codec was transparent (confidence intervals did not overlap with those of the hidden reference). So, the test showed - unexpectedly for me, the conductor - that codecs B and C were "not as good" as the original A. So did we reveal the subconcious? Maybe Anyway, mission accomplished.

So, my two cents:

Should we care about ABX(HR) test results for a single test item and from a single person? No, because it does not allow us to generalize to other items and listeners.
About ABX(HR) results for many items from many listeners? Yes. If you then average over all items and listeners and find that two devices A and B have about the same rating, then you can conclude that to a large number of people similar to the ones in the test, A will most likely be as good as B.

Chris

Notice