Help - Search - Members - Calendar
Full Version: An interesting approach to ABX testing
Hydrogenaudio Forums > Hydrogenaudio Forum > General Audio
2Bdecided
Someone has just suggested an interesting approach to ABX testing, which they believed was standard ( and they should know! ). It's something that I haven't heard of before.

Here at HA, we usually do this:

A=Original sample ( O )
B=Coded sample ( C )
X=random choice of the above ( A or B = O or C )

So you'll get three clips in each trial, either ABA or ABB ( OCO or OCC )

This means you don't really need to listen to the first and second clips if the difference is obvious - just listen to the third clip, usually labelled X, and say what it is.

(There is also the version where you have X and Y, meaning you have ABAB or ABBA - more to listen to, but statistically identical (?).)


However, someone has just suggested the following to me:

For each trial, decide whether
A=original sample ( O )
B=coded sample ( C )
or
A=coded sample ( C )
B=original sample ( O )

Then have X=random choice of A or B ( O or C ).

This means you have four possibilities: OCO, OCC, COO, COC.

(Or, to put it another way, if A=O and B=C, then it's ABA, ABB, BAA, BAB)


This means that, for each trial, you must listen to the first and second audio samples, then decide whether the third audio sample is the same as the first or the second.

This means you can't so easily carry an idea of what the coded version's faults are in your head, because you don't know which is the coded version - but then, it doesn't matter which the coded version is, because ( like ABX ) all you need to do is match a pair to win!

I was told this is called "ABX", but I think is need a different name, maybe rABX to indicate A and B are also changed, or XYZ?

Whatever you call it, I'd imagine this kind of testing is harder for people who are very familiar with "our kind" of ABX.

Mad fool that I am, I'm interested to know the relative sensitivity of ABX, ABXY, rABX, rABXY, and simple ABC pick the odd one out ( used in most psychoacoustic tests ).

The statistics for "really detecting a difference to 95% confidence" should be the same, but unless you have perfect concentration, I suspect the chances of passing a test with a given artefact are slightly different depending on the test methodology.

Is this measurable? Does it matter? Do you think they'll be any difference in practice?

Cheers,
David.

P.S. The rABX approach was suggested as more indicative of real world listening. There's a chance that it might be less sensitive ( because you can’t be sure that B contains any possible artefact, while A does not, so you will be less certain going into the test ), but also there's a chance it may be more sensitive: in situations where the switching is out of your control, so you get A,B,X at predefined intervals, A is always preceded by you giving your previous response, then complete silence, while B is always preceded by A then some silence. If the simple fact that A is always first makes it sound different enough from B that you miss the ( real but tiny ) difference in B, swapping the two around may allow you to focus on the real difference, rather than the imagined one.
idioteque
QUOTE(2Bdecided @ Mar 24 2005, 11:02 AM)
The statistics for "really detecting a difference to 95% confidence" should be the same, but unless you have perfect concentration, I suspect the chances of passing a test with a given artefact are slightly different depending on the test methodology.
*



Isn't the point of the ABX test to show that there is a discernable difference between two samples? It isn't a test of one's concentration, although concentration is necessary. This method seems to just make the tester get tired quicker.
Sebastian Mares
Isn't that what ABC/HR is? Where the reference is hidden so you don't know what A and B is?
SirGrey
>>Isn't that what ABC/HR is? Where the reference is hidden so you don't know what A and B is?
It seems that yes. At least I always thought so unsure.gif
wimms
QUOTE(2Bdecided @ Mar 24 2005, 08:02 AM)
Whatever you call it, I'd imagine this kind of testing is harder for people who are very familiar with "our kind" of ABX.

Mad fool that I am, I'm interested to know the relative sensitivity of ABX, ABXY, rABX, rABXY, and simple ABC pick the odd one out ( used in most psychoacoustic tests ).

The statistics for "really detecting a difference to 95% confidence" should be the same, but unless you have perfect concentration, I suspect the chances of passing a test with a given artefact are slightly different depending on the test methodology.

Is this measurable? Does it matter? Do you think they'll be any difference in practice?
I don't see any benefit in that method. What is the purpose of ABX test? It is to *disprove* null hypethesis. Nothing else. So it actually makes sense to be "helpful" in noticing difference, and determine false successes by purely statistical means - reduce type 1 errors.
I'm afraid that by making reference obfuscated, the only thing achieved is increased risk of type 2 errors with corresponding decrease of type 1 errors. When difference is very subtle, risk of type 2 errors is already too high, why would you want to increase it even more? Easiest way to do that would be to force 10 seconds silence between replaying any samples. Would you want to do that? I don't think so, that would drastically reduce the sensitivity of detecting difference. In a sense, I consider this method to have similar effect.

In practice the difference is in this: to reduce type 2 errors you need to have pretty large number of trials, and making each trial so much harder, you actually reduce number of trials listener would stand to take.

Sounds like the method might be prefered by people who are trying to "prove" null hypethesis... or simply show that majority of population doesn't give a damn..
ff123
QUOTE(2Bdecided @ Mar 24 2005, 08:02 AM)

However, someone has just suggested the following to me:

For each trial, decide whether
A=original sample ( O )
B=coded sample ( C )
or
A=coded sample ( C )
B=original sample ( O )

Then have X=random choice of A or B ( O or C ).

This means you have four possibilities: OCO, OCC, COO, COC.

(Or, to put it another way, if A=O and B=C, then it's ABA, ABB, BAA, BAB)

*



I believe this is sometimes referred to as a triangle test, sometimes called "odd man out" by people here. I think kikeG's ABX program can do a version of it. The statistics are more efficient than the ABX we're used to (I think, I'd have to look it up), but since the task is more complex, I'd bet it's less sensitive than ABX.

ff123
echo
This is a triangle test as it is used in sensory evaluation tests. The probability of guessing in each trial is 1 out of 3 (~33%) instead of 1 out of 2 (50%) of our regular ABX tests (which are duo-trio tests).

In a duo-trio test you could get a significant result (probablility of guessing<5%) with 5/5 if you got everything right.
In a triangle test a significant result is reached with only 3/3. The downside of it is that if you make a mistake the probability of guessing rises very fast. So it can be helpful if you want to determine if you can detect differences with fewer trials.
Woodinville
QUOTE(idioteque @ Mar 24 2005, 08:28 AM)
QUOTE(2Bdecided @ Mar 24 2005, 11:02 AM)
The statistics for "really detecting a difference to 95% confidence" should be the same, but unless you have perfect concentration, I suspect the chances of passing a test with a given artefact are slightly different depending on the test methodology.
*



Isn't the point of the ABX test to show that there is a discernable difference between two samples? It isn't a test of one's concentration, although concentration is necessary. This method seems to just make the tester get tired quicker.
*



Swapping A and B in an ABX test can only confuse the listener. Maintaining the listener's reference conditions has always seemed to be the best way to ensure a sensitive test. A and B can, furthermore, be known, i.e. A is original, B is processed. X, of course, must be randomized per-trial.

An ABC/hr test puts the reference in A every time, and one of B and C is the reference. An ABC/hr test allows for difference grading.

Strictly speaking, an ABX test only tests for audibilty, no more.
2Bdecided
Thanks for all your replies - I knew someone would be interested!

I'm not sure the proposed test is a triangle test. In a triangle test, you must identify the odd one out, so the options are ABB BAB BBA BAA ABA AAB, with the correct answers to these being 1 2 3 1 2 3 in that order.

The test I've mentioned is ABX in form, but the As and Bs are independent across tests. It's not a triangle test, because there are two options in the triangle test that will never occur in this ABX form test (BBA, AAB). Also, the participant's attention is being directed to the identification of the third sample (which must be the same as one of the first two), rather than picking an odd one out.

Surely this means that the statistics don't allow proof in a smaller number of trials than HA ABX? Or is there still an increase in statistical efficiency with this method?


Google (triangle test sensory evaluation) also informed me that a 3 interval forced choice test is different from a triangle test, because the odd one out is always B, and the participant is typically told, or taught, the specific way in which B is different from A.

So it seems that the proposed ABX form test is like a triangle test in one way: the participant does not know or get to learn about the unique properties of B compared to A (unless the differences are obvious) - whereas in a typical HA style ABX test, and a typical 3 interval forced choice test, the participant does get to learn about the unique properties of B compared to A.


Now (and someone please help me here!), if this test is no more statistically efficient than HA style ABX, and if this test makes the task harder, then why use it?!

QUOTE(wimms @ Mar 24 2005, 08:06 PM)
Sounds like the method might be prefered by people who are trying to "prove" null hypethesis... or simply show that majority of population doesn't give a damn..
*



Yes, the same thought occurred to me! Which is why I was asking.

I see no one has commented on the P.S. from the original post, which was a paraphrase from the person who proposed this test.

If I speak to this person (an academic) again, I'll argue these points and see what they say.

Cheers,
David.
wimms
QUOTE(2Bdecided @ Mar 29 2005, 04:26 AM)
I see no one has commented on the P.S. from the original post, which was a paraphrase from the person who proposed this test.
QUOTE
in situations where the switching is out of your control, so you get A,B,X at predefined intervals
This may be a key phrase there. It seems implied public ABX tests where many people audition common source. Then indeed there is a risk that rythm of the listening session skews the preference in nonrandom and individual manner. Like "first is best, second is next, third is worst. So, um, third must be B". Barabim-barabam-barabom. Or "3 is my lucky number". This might become distractive even when admitted, so they decide to give both A and B fair chance and equal conditions in any imaginable manner. Sorta physical dither biggrin.gif

Doesn't apply when switching is in your control and you have all the time on earth.
echo
Hi David,

having read you original post again (more carefully this time) I agree that this is not a triangle test as I claimed in my previous post.

In a typical ABX test the probability of getting a correct answer in each trial is 50%. In you proposed "rABX" test the probabilities don't change. You still have a 50% chance of getting it right, since you assign X to be either A or B. It doesn't matter which one is the reference. So the same number of trials is needed as in a typical ABX test.

When the artifacts are obvious you are right to think that just by listening to X you can determine if it is the reference or the encoded file. And yes this makes the rABX test more difficult and it would make listeners get tired sooner. But when the artifacts become less and less discernable I think it is just the same. You still have to listen to the reference or the coded file several times in a typical ABX test to decide which one X is, because you can't tell for sure if you heard an artifact or not in X.

QUOTE
So it seems that the proposed ABX form test is like a triangle test in one way: the participant does not know or get to learn about the unique properties of B compared to A (unless the differences are obvious)

In a sensory evaluation test the testers are always briefed about the possible differences between items and have a discussion about them prior to the test. This is also the case for triangle tests. Where did you read that in a triangle test the tester does not learn about the properties of compared items? This happens because usually the test's aim is not only just to see if testers can tell between items. That is just the first step. After that they are asked to evaluate the items on several properties, the ones they discussed about in the beginning (something like ABC/HR but with several properties in every item and not just "preference").

QUOTE
so you get A,B,X at predefined intervals, A is always preceded by you giving your previous response, then complete silence, while B is always preceded by A then some silence.

I doubt this could work for listening tests. The differences most of the times in higher bitrates are so small that you will have to listen to all samples several times to make up your mind. And I don't know if the ear responds well in regard to artifact identification with added silence in between. Sometimes it is easier to pick an artifact (like noise, ringing) just by switching samples while playing. I don't think that "reseting" the eardrum (that's what the silence is about) can be considered like rinsing the mouth between tasting different food samples (which is compulsory in a food sensory evaluation test). And I believe that we should try to make the listeners job easier, not harder, without of course any compromises in the validity of the results.

QUOTE
If the simple fact that A is always first makes it sound different enough from B that you miss the ( real but tiny ) difference in B, swapping the two around may allow you to focus on the real difference, rather than the imagined one.

Yes, that happens. Especially in sensory evaluation tests the testers are always given the test items in a predefined order, which is different in every run, just to avoid bias in their answers. This mostly has to do with property evaluation tests and not difference tests. In difference tests people usually go back and forth, from one item to another, until they can detect something. That said I always try to listen to either one first, then the other, pause, and then try the other way around and around several times to make up my mind of what an artifact is. In this respect maybe the rABX test would be a nice tool to eliminate the possible (but not certain) bias.

QUOTE
This may be a key phrase there. It seems implied public ABX tests where many people audition common source. Then indeed there is a risk that rythm of the listening session skews the preference in nonrandom and individual manner. Like "first is best, second is next, third is worst. So, um, third must be B". Barabim-barabam-barabom. Or "3 is my lucky number". This might become distractive even when admitted, so they decide to give both A and B fair chance and equal conditions in any imaginable manner. Sorta physical dither

Well, I always thought this could be an issue with ABC/HR tests too and it relates to my previous point. People will usually try sample 1 first, 2 second and so on. This could lead to systematical listening fatigue between samples that could bias the results. To avoid that issue, in sensory evaluation tests the items are numbered with random 3-digit numbers (like 543, 238, 984...), instead of naming them 1, 2, 3... or A, B, C... etc. They are told that these numbers were chosen randomly and don't mean anything at all, and this is actually easier for the testers to accept. How could this translate to listening tests? Maybe the samples could be randomized for each listener by the ABC/HR program instead of sample 1 being always sample 1 and so on, so the bias would not be systematic between listeners.

Wow... this must be my longest post ever... smile.gif

Cheers,

George
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.