Help - Search - Members - Calendar
Full Version: Listening Test: LAME 192 CBR vs OGG Q4.99
Hydrogenaudio Forums > Lossy Audio Compression > Ogg Vorbis > Ogg Vorbis - General
Pages: 1, 2
johanfo
In my previous poll, there was divided opinions on which was the better: LAME @192 (lame -b 192 -h) or ogg -q 4.99

I have now created 6 wav files, from two different tracks. Each track has one file for the original wav, one for lame, and one for ogg.

Tracks Destinys Child / Writings on the wall:
Jumpin Jumpin [original wav named 101]
Bills, Bills, Bills [original wav named 200]

The tracks encode at 225 and 234 using lame aps. However that setting in lame is not the issue in this test. It is just to prove that the tracks are quite complex.

Those of you who thought you could hear a difference should now be able to point out which one is what. When you have done so, please let me know. I'll later present the complete results here.

URL: http://www.ohman.no/various/ogglametest/
See the Instructions.html file there too as well.

PLEASE DONT POST TEST SPOILERS (frequency analysis etc). SAY "IF AND HOW" YOU HEAR A DIFFERENCE WITHOUT SPECIFYING THE FILE. COMPLETE AND SPECIFIC TEST RESULTS SHOULD MAILED TO ME:

Good luck smile.gif
JensRex
Well... if there's one thing this test has proved to me, it's that I don't need to worry about reencoding all my --aps rips any time soon smile.gif.

I don't know if I should be happy about that, or sad because of my untrained ears.

I have very good equipment actually. Only thing lacking is my headphones. Crappy ear-plug type. I'll get me some HD600's and start out with some Xing 128 kbit's and work my way up smile.gif.
johanfo
QUOTE
Originally posted by Zalkalin
Well... if there's one thing this test has proved to me, it's that I don't need to worry about reencoding all my --aps rips any time soon smile.gif.


In other words: You coudn't tell them apart ??
JensRex
[SNIP-E-DI-DOO-DA]

To answer your question - no I can't tell them apart at all.
johanfo
No, please don't spoil the test smile.gif But you can mail me it. I can create a better report the more information I sit on. It doesn't mather if it "might" be wrong.
Jon Ingram
Are you going to tell us which is the original? Knowing this, I can perform the two most useful tests - an ABX of original-vs-ogg, and an ABX of original-vs-wav.
johanfo
QUOTE
Originally posted by Jon Ingram
Are you going to tell us which is the original? Knowing this, I can perform the two most useful tests - an ABX of original-vs-ogg, and an ABX of original-vs-wav.


I didn't plant to do that initially. The test will first be about finding the various types of audio without knowing what is what. But of course, if you have done so and set me the results, just mail me and I'll give you the names of the wav so that you can ABX it too. The more analysis done the better !
Garf
QUOTE
Originally posted by johanfo

I didn't plant to do that initially.  The test will first be about finding the various types of audio without knowing what is what. 


I totally miss the point of this. There's no way to tell which is more true to the original if the tester doesn't know what the original is.

What are you testing? What is the _goal_ of this test?

--
GCP
johanfo
QUOTE
Originally posted by Garf

I totally miss the point of this. There's no way to tell which is more true to the original if the tester doesn't know what the original is.
What are you testing? What is the _goal_ of this test?
-- 
GCP


Well, you have a good point there! I haven't done these kind of tests before. I thougt maybe it was possible to single out the encoded files without knowing what the referenc file is.

But there is no problem in changing the test a little.

101 and 200 are the reference files (original) !
tangent
You should read about the proper ways of conducting listening tests first.

http://ff123.net
http://sjeng.sourceforge.net/audio/codecs.html
johanfo
QUOTE
Originally posted by tangent
You should read about the proper ways of conducting listening tests first.
[/B


I have read those pages before. My test was initially to see if someone could instantly recognize the encoded tracks without knowing anything about them except that there is one wav one mp3 and a lame.

You may say it is impossible to do so since you dont know what the reference sound is. Lets say I encoded something at 56 Kbit instead, then everyone would easily find the the encoded track... I would say it is possible to rate a track without having a reference. Clearness, crispness, preecho etc...


Anyway, I have now revealed what is the reference wav files. So those of you who want to ABX it can do so.

So far, no one has been able to separate the tracks. So OGG 4.99 or Lame 192CBR is not THAT bad smile.gif
Garf
QUOTE
Originally posted by johanfo

I would say it is possible to rate a track without having a reference. Clearness, crispness, preecho etc...


No. For example, clearness and cripsness may be added by a codec, and in that case, they're artifacts.

In the extreme case, you'll be able to tell it's exaggerated, but both of these settings should be relatively HQ, so it wont work.

Edit: Sent my results. This was harder than I thought.

--
GCP
cd-rw.org
What's the point of comparing CBR MP3 to a VBR OGG?
rjamorim
QUOTE
Originally posted by cd-rw.org
What's the point of comparing CBR MP3 to a VBR OGG?


CBR MP3 @192kbps is used by the "Scene" for music distribution.

Ogg can't be fairly compared to other formats using CBR, since it's much better at VBR.

Regards;

Roberto.
JohnV
Results sent.
johanfo
QUOTE
Originally posted by JohnV
Results sent.


Thanks for the results so far. I have got 4 result sets so far.
I'll wait some more so that more people can participate if they want.
johanfo
The results are in from the test. The purpose of the test was to compare a common ogg vorbis setting with the standard 192 CBR Full stereo that is used by almost all "Release/Rip groups"

Allthough, well over 20 persons have downloaded all the files, only 4 results was sent in. You have to interpret this youself as well.

Files:
100 == Ogg
101 == WAV
102 == MP3

200 == WAV
201 == OGG
202 == MP3


The various people and their comments

Jens Rex:
QUOTE
Only thing i have been able to identify as of now is that 100 might have slightly higher volume than the other two. It might just as well be my brain playing tricks on me smile.gif


Erik Stenborg:
QUOTE
Hei!
I tried the test but am sorry to say that I didn't hear any differences
at all. Good initiative anyway. 


Now, for those who did hear a difference. No surprise, it was the moderators on this list:

Garf:
QUOTE
 
100: 12/13  distortion/smear on one of the ticks 
102:  /    can't ABX
201:  9/10  I can hear loss on the very HF (lowpass?)
202: 12/16  Same as above


The most comprehensive answer was from JohnV:
QUOTE
 
100 - Jumping jumping: 
ABX 9/9. Background noise is too bright and a bit amplified during 2-10 seconds. This already reveals that this is most probably a Vorbis sample, because this is typical to Vorbis. Otherwise quite fine, at least not too obvious artifacts to my ear.
 
102 - Jumping jumping:
ABX 9/9. I ABXed this only from the slightly increased swishing sound during 22-23 seconds. Otherwise not too obvious problems.
 
201 - Bills, bills, bills: 
ABX 9/9. Very easy to ABX. Additional noise with \"triangles\" during 3-4sec. General loss of punch, maybe because of slight smearing of synth harp transients. The lower synth harp sound at 23sec is quite clearly distorted.
 
202 - Bills, bills, bills: 
ABX 9/9. Very easy to ABX. Loss of general punch, probably because of slight smearing of synth harp transients, and few very clear problems. Hihat creates clear pre-echo, especially _bad_ during 6-7 seconds, hihat sounds somewhat smeared almost all the time.
 
I'd rate 102 clearly higher than 100. And 201 a bit higher than 202.


Now, compare these statements with the true content of the files. JohnV found the 192CBR to be better than the Ogg file in "Jumpin, Jumpin". However in Bills Bills Bills, 201, which was a Ogg file, was considered better then the CBR 192.

Garfs, result clearly indicated that CBR 192 is much better on the Jumpin track. On the Bills track, his ABX suggests that the MP3 track is slightly better. Maybe little contrary to JohnV rating of 201.

QUOTE
This is silly - there's too big a bitrate difference. Change the Vorbis option to -q 5, and I'd vote for 'the same'.
--Jon Ingram

I think this has indicated that comparing Ogg -q 4.99 and Lame CBR 192 is not "silly", although Lame seems to have the "edge" after all. Maybe Ogg RC4 will even the score even more. I'd be glad to host one more test when it is released if there is any interest (any inside on a possible release date, anybody?).
2Bdecided
Ooops - I emailed you after finding the test on r3mix, and then read the results here!

Here's what I wrote:

I tried your test using the Jumpin Jumpin samples.

I listened to them a few times first, and had a feeling that 101 was the
original. I loaded 100 vs 101 into PCABX, and it warned me that they
were a different length, and that this would prejudice the test.

I loaded all three into CEP to get them synchronized and the same
length. I had to cut many samples off the end of 100 and 102, and also
many samples off the start of 102 - you should have done this before
distributing the samples!

So, anyway, having seen them in CEP I knew 101 was the original, so I
ABXd 101 vx 100, and 101 vs 102. I could differentiate the first pair
(the hi hats in 100 are slightly smeared, and the track sounds slightly
brighter), but I could NOT differentiate the second pair (102 sounds the
same as the original to me).

Hope this is some help,
Cheers,
David.
http://www.David.Robinson.org/

P.S. the original track is heavily clipped (I know this is just the way
the CD is mastered) - but this means that encoding/decoding will add
more clipping, and this may give some people an extra clue as to which
samples are encoded and which is the original. In these circumstances,
you should reduce the amplitude of the ripped version, and then use this
copy as the "original" and encode from it.
johanfo
QUOTE
Originally posted by 2Bdecided
I had to cut many samples off the end of 100 and 102, and also
many samples off the start of 102 - you should have done this before distributing the samples!


You are absolutely right... I'm learning from my mistakes smile.gif
tangent
QUOTE
Originally posted by johanfo
Garfs, result clearly indicated that CBR 192 is much better on the Jumpin track.  On the Bills track, his ABX suggests that the MP3 track is slightly better.  Maybe little contrary to JohnV rating of 201.

Well, apparently you don't understand ABX at all. You cannot use ABX trial results of 2 individual samples compared against the original to conclude that one is better than another. In another words, the only thing Garf's 9/10 and 12/16 results show is that he can distinguish the sample from the original with 98.9% and 96.2% confidence respectively. This says nothing about whether one is better than another. To do that, you have to put ABX one sample against another with reference to the original and do 16 preference tests.
johanfo
QUOTE
Originally posted by tangent

Well, apparently you don't understand ABX at all. 


Sure I know how ABX works. On the Jumpin track he couldn't ABX the LAME file (compared to the original), but he could do that with the Ogg file. That IS a good indication that the Lame is better. You are assuming that there is no correlation between the percentage of the confidence and the quality of the track. Well, I think there is !

On the other hand, the data set on the second track 9/10 & 12/16 is maybe to close to infer anything solid from. I would agree on that is doesn’t say especially much there.

I have given all the raw data, so draw you own conclusions if you are dissatisfied with mine.
tangent
QUOTE
Originally posted by johanfo
Sure I know how ABX works.  On the Jumpin track he couldn't ABX the LAME file (compared to the original), but he could do that with the Ogg file.  That IS a good indication that the Lame is better.  You are assuming that there is no correlation between the percentage of the confidence and the quality of the track.  Well, I think there is !

On the other hand,  the data set on the second track 9/10 & 12/16 is maybe to close to infer anything solid from.  I would agree on that is doesn’t say especially much there.

I have given all the raw data, so draw you own conclusions if you are dissatisfied with mine.


I have no arguments about the jumping track, but there is definitely NO correlation between the confidence and the quality of a track. To insinuate that there is a correlation is to take advantage of the common reader's lack of knowledge about ABX testing to feed them misinformation.

The only result you can conclude from Garf's test of Bills is that he can distinguish the encoding from the original for both encodings with significant (>95%) confidence. Since he did not mention which one he found better, you cannot even begin to conclude which one is better.

Common readers still have trouble understanding how you can arrive at 98.9% for 9/10 and 96.2% for 12/16 and insist that it should be 90% and 75% respectively. Let's not mislead them further by trying to draw conclusions about quality from confidence. To ask them to draw their own conclusions when they don't understand how to use the data, knowing that they are likely to draw the wrong conclusions is just as bad.
johanfo
QUOTE
Originally posted by tangent

...but there is definitely NO correlation between the confidence and the quality of a track. 


Lets pretend a 100 people listen compare track A with C. A is an encoded version of C. In this case the average ABX is 5/15.

Now let them compare track B (encoded too) with C. The average ABX is now 14/15. To me it seems quite logical that track A is "of a higher quality" than track B since it is much harder to distinguish from the original.

I'd really like to see your statements being backed by any papers or someone else here. If not to convince me, you should do that for some of the "common reader" that might agree with me.
tangent
Like I said, you just shown that you don't understand how ABX is supposed to be used, because you used the results from the ABX of 100 people in the wrong manner.

Each person's ABX test is his own test and there is no purpose in averaging across everybody's results, because it defeats the purpose of the test which is to determine someone's confidence of being able to hear the difference between the sample and the original, which means a person has to achieve 14/16 to be considered to have passed the test with enough confidence (>95%).

In another words, this is the wrong way to use the ABX results: A is better than B because the average ABX result is 5/16 for A and 14/16 for B.

This is the correct way to use the ABX results: A is better than B because 93 out of 100 of the test audience are able to ABX at least 12/16 for B while only 5 out of the 100 are able to ABX at least 12/16 for A.

Here's a simple proof to show why ABX results do not have an acceptable correlation to quality. Let's say you cannot tell the difference at all, so you are guessing all the way. How much are you likely to score? You have no idea, but it averages around 8. If you understand binomial distribution, you will probably know the odds of scoring 8 out of 16, which will be the most likely outcome. But you are also likely to score 7 out of 16 or 9 out of 16, maybe 6/16 or 10/16 which are less likely but still quite likely. Does that mean a sample scoring 6/16 has better quality than a sample scoring 10 out of 16? Of course not.

Results from one ABX test can only be used one way. 0 to 11 out of 16 means a failed test. 12 to 16 out of 16 means a passed test. There is no such thing as degree of failure or degress of passing on ABX.

The papers you are looking for are at http://www.pcabx.com and you will find them an enlightening read.
ErikS
QUOTE
Originally posted by tangent
0 to 11 out of 16 means a failed test. 12 to 16 out of 16 means a passed test. There is no such thing as degree of failure or degress of passing on ABX. 


Wouldn't a score of 0 indicate that you hear a difference but you're not concious of it yet? wink.gif
tangent
If you hear a difference but you are not conscious of it, you shouldn't be doing an ABX test. You should only start ABXing if you think you can hear a difference without the ABX test. The ABX test ensures that you are really able to hear the difference and you're not imagining it.

But if you do an ABX test anyway, subconsciously hear a difference but are not conscious of it, your result should be random too and be near 8/16.

0/16 most likely means you didn't read the instructions carefully and select the better instead of the worse (or vice versa). Or you are deliberately trying to be funny smile.gif
johanfo
QUOTE
Originally posted by ErikS


Wouldn't a score of 0 indicate that you hear a difference but you're not concious of it yet? wink.gif


Score 0 means that you have no luck what so ever, even when guessing biggrin.gif
Garf
QUOTE
Originally posted by ErikS


Wouldn't a score of 0 indicate that you hear a difference but you're not concious of it yet? wink.gif


Yes. So would 4/16 for that matter.

--
GCP
johanfo
QUOTE
Originally posted by tangent
Like I said, you just shown that you don't understand how ABX is supposed to be used, because you used the results from the ABX of 100 people in the wrong manner.

I claim that each time you have to chose A or B in ABX your choose is affected by white noise (random noise) the same probability, that is proportional with the quality of the track. With higher quality the chance that you do wrong increases which leads to a lower confidence percentage.

Have you ever taken multiple choice math tests? It's the exact same concept. Your points could be interpreted as the probability 'that you know math’! 15/15 would mean you know this math! But it ALSO says that the test was quite simple for you. If the test becomes harder, pushing your knowledge, you will probably score lower thus a correlation between difficulty and points, and in music, quality and points.

QUOTE

Each person's ABX test is his own test and there is no purpose in averaging across everybody's results.



Sure you can ! Just because ABX wasn't made explicit for that purpose, the data say just the same. Example: If the average a 100 persons doing an ABX 16 test is 8, then it is a VERY good indicator that the objective quality measurement is excellent

We won't agree on this issue. I think that's the bottom line here.
JohnV
johanfo: I don't quite agree with you. Mainly because the rating of the quality is not so simple and is always very subjective.

For example: a sample#1 has a single a bit more audible artifact leading to 16/16 ABX score, but is otherwise very good. Then there might be a sample#2 with constant distortioning which might be a bit more difficult to hear sometimes, leading to 14/16 score. I would still consider it perfectly possible that I rate the 16/16 sample#1 much worse than sample#2.

Also it's perfectly possible that you would get 16/16 on second time you do ABX with sample#2.

I personally always do separate rating after ABXing when I rate which sounds better. I usually load both encoded samples to WinAmp and do blind tests rounds, and see which one I like more.
timcupery
JohnV-

I do the same thing in Winamp, listening to files back-to-back blindly to figure out what (if any) difference I can fixate upon... then I ABX.

johanfo-

you should take some more statistics. Valid claims can't always be drawn from averages, depending on what the structure of the test is. this is where sociologists have something to contribute to the forum, as sociologists rolleyes.gif

almost finished packing for Yosemite...
Garf
QUOTE
Originally posted by johanfo

I claim that each time you have to chose A or B in ABX your choose is affected by white noise (random noise) the same probability, that is proportional with the quality of the track.  With higher quality the chance that you do wrong increases which leads to a lower confidence percentage.


No. Quality is a subjective measure which does not necessarily correlate to the scores on an ABX test.

QUOTE

Have you ever taken multiple choice math tests?  It's the exact same concept. 


No. Multiple choice tests penalize wrong answers. They need to, because, as you should know by now, too many students would pass simply with random guessing otherwhise. There is no simple relation to ABX tests.

QUOTE

15/15 would mean you know this math!  But it ALSO says that the test was quite simple for you.


From experience, I can assure you one does not follow from the other. Effort expended on the test could play a major role, and exactly the same thing is true for an ABX test.

_Perhaps_ if you could do an ABX test with fully controlled circumstances, there might be value to the idea. In this environment, that is not possible. Moreover, you would need to establish a relation between those scores and the subjective quality first. Just assuming there is one isn't very scientific.

--
GCP
tangent
QUOTE
Originally posted by johanfo
I claim that each time you have to chose A or B in ABX your choose is affected by white noise (random noise) the same probability, that is proportional with the quality of the track.  With higher quality the chance that you do wrong increases which leads to a lower confidence percentage. 

This 'noise' is not random but usually caused by fatigue and environment, where the standard deviation and mean is definitely not constant across different tests.

QUOTE

Have you ever taken multiple choice math tests?  It's the exact same concept.  Your points could be interpreted as the probability 'that you know math’!  15/15 would mean you know this math!  But it ALSO says that the test was quite simple for you.  If the test becomes harder, pushing your knowledge, you will probably score lower thus a correlation between difficulty and points, and in music, quality and points.

15/15 gives a good probability that you know your maths. But it doesn't tell you how good your maths is. If I do the test and score 15/15 and Einstein does and scores 15/15, does it mean that we are both as good as each other in maths? Definitely not. Similarly, if you can ABX two samples 16/16, does that mean they are both as bad?

Maths test isn't a good analogy either. A maths test can test several different fields of maths. Let's say half the paper is trig and the other half is calc, you know only trig, so you score full marks for the trig section and half marks for the calc section, total 12/16. See the problem? An ABX listening test is a constant thing. The same 2 samples being tested over and over again 16 times, the same question being asked 16 times.
QUOTE

Sure you can !  Just because ABX wasn't made explicit for that purpose, the data say just the same.  Example: If the average a 100 persons doing an ABX 16 test is 8, then it is a VERY good indicator that the objective quality measurement is excellent

Sure, there can be some correlation between average ABX results and quality, but it will be a very weak correlation, which does not work all the time. Imagine a low bitrate listening test where everyone can hear differences. Every sample will score near 16. Are you going to pick the best quality sample by the result nearest to 16? A low bitrate test becomes a more subjective thing. Now, a high bitrate listening test, you will probably get a few people able to ABX while the majority failing the ABX. You can get some correlation to quality by averaging everyone's result, but this will be very silly when there is a much better way of correlating quality, by simply counting the number of people who successfully ABX each sample.

In any case, there is completely no use in using average ABX results as any indicator of quality. Even if you can get some correlation, there are better things you can do with the results to get a more accurate measure.


But I would still like to hear your reason why you want to continue to defend the claim that "(Garf's) ABX suggests that the MP3 track is slightly better. Maybe little contrary to JohnV rating", which is a completely wrong and misleading conclusion.
johanfo
QUOTE
Originally posted by Garf


No. Multiple choice tests penalize wrong answers. They need to, because, as you should know by now, too many students would pass simply with random guessing otherwhise. There is no simple relation to ABX tests.
-- 
GCP


The penalizing is just scaling the scores in various ways and has nothing to do with the "infomation" given by a test. Wrong: -1 and Right:1 or Wrong:0 Right:1 really doesn't matter.
johanfo
QUOTE
Originally posted by tangent


Sure, there can be some correlation between average ABX results and quality, but it will be a very weak correlation, which does not work all the time.

\"but there is definitely NO correlation between the confidence and the quality of a track\" 
--tangent previous post
....

But I would still like to hear your reason why you want to continue to defend the claim that \"(Garf's) ABX suggests that the MP3 track is slightly better. Maybe little contrary to JohnV rating\", which is a completely wrong and misleading conclusion.


First I'll comment former section.
- You see the your slightly conflicting posts ?

On the second paragraph:
I have not actively defended _that_ statement. Instead I admit now that I put too much weight on the "confidence - quality” correlation and that my claim that "this track is better than that" is ambiguous given the data. Again, this doesn't mean I say there is no correlation, I just say that those numbers given are not sufficient to back up the statements.

I also emphasized that the comments where mine and that people must make up their own mind from the data... Something I think most people have done here now.
shday
What is wrong with pooling or averaging multiple ABX scores? I think if it's done thoughtfully it is very useful. (I learned about ABX just now btw... I guess that makes me a "common reader").

For example:

One person scores 12/16, giving ~96% probability that it wasn't a fluke.

Ten people score 10/16, each giving a ~77% probability that their test wasn't a fluke.

Ten people at ~77% each gives an overall confidence of more that 99.9% (much more I think) that they, as a group, didn't just get lucky.

Obviously you would be missing something if you concluded from the ten person test that there was no audible difference (at 95% confidence) between the samples.

It seems to me there *is* clearly such thing as degree of failure or degree of passing on ABX. Picking 12/16 and above as pass and 11/16 and below as fail is potentially throwing away alot of information!

... johanfo has some good arguments.

Steve
SometimesWarrior
I wish our resident statistics expert, ff123, could put a rest to all of this smile.gif (Although IMHO tangent, Garf, and JohnV have already given good enough arguments to settle the matter, there are others in this forum that apparently disagree)
QUOTE
Originally posted by shday
Picking 12/16 and above as pass and 11/16 and below as fail is potentially throwing away alot of information!
12/16 is chosen as a cutoff because it indicates a >95% probability that the ABXer chose the sample because she heard a difference, not by random chance. 95% probability is a standard cutoff in statistics; 12/16 gives a 96% probability, while 11/16 gives a 90% probability.

The 95% cutoff is an arbitrary cutoff, but it works well. It still means that there is a 1 in 20 chance that the listener doesn't hear a difference and simply guessed well, but usually the listener does hear a difference and simply uses the ABX to prove it. Preferably, the listener will score even better than 12/16 (for example, 13/16 gives 99% and 16/16 gives >99.9%), but most people on this forum will accept 12/16, along with a good description of the problem, as proof that the listener hears a difference. Also, there is no reason why the listener should perform exactly 16 tests, although 16 does give a good balance between statistical significance and listener fatigue.

QUOTE
Originally posted by shday
Ten people score 10/16, each giving a ~77% probability that their test wasn't a fluke. 

Ten people at ~77% each gives an overall confidence of more that 99.9% (much more I think) that they, as a group, didn't just get lucky.
Honestly, I don't mean to pick on shday at all smile.gif His conclusions do at first seem to have good logic behind them, but I believe the math and statistics contradict these claims. If I wasn't so rusty with my statistics, or if I had a statistics book handy, I could put the math right here.

Here's (if I remember correctly) the correct conclusion.

If a listener scores 10/16 on an ABX test, then there is about a 1 in 4 chance that the guesses were totally random. However, if he the next two subsequent ABX tests are also 10/16, that gives a total ABX value of 30/48, which gives a satisfactory 95% probability. Note that this addition does not apply if the listener gets 10/16, 3/16, 13/16, 10/16 on four sequential tests. Even with a 13/16 on an individual test, the total ABX up through the third test is 26/48 (65% probability) and through the fourth test is 36/64 (80% probability). By picking and choosing ABX tests, one invalidates his results. All ABX tests performed must be included in the statistical probability.

This logic can be extended to an ideal listening group. Assuming 20 identical listeners in 20 identical environments (and this is not the case on this forum), there is a certain statistical probability that one will get 12/16 correct on an ABX test purely by guessing. There is some math involved that I can't recall, but I do know that there is significance if one individual can get 12/16 ABX, but if that person is part of a group of many identical people, then that single 12/16 ABX test carries much less significance. This makes sense; flipping a coin five times in a row will probably not give five heads in a row on the first try, but if one keeps flipping, there is a good chance that five sequential heads will eventually occur.

This does not mean that if no one but JohnV can hear a difference, then JohnV is wrong; no two listeners are alike, so we do not have an ideal listening group, and the logic in the previous paragraph does not apply to this forum.

If ten people each score 10/16 ABX, it does not mean that the aggregate score was 100/160; it means that no one person could reliably hear the difference between the two samples. However, if a single person takes ten sequential ABX tests and gets a 10/16 on each, that does constitute a 100/160 ABX, which is very significant.

Now, another important concept to understand is that the ABX does not test the relative quality of the audio files; it only says whether or not a person can hear any difference at all. ABX gives a "yes or no" response, not a "moreso or less so" response. This is why ff123 conducts preference tests by having listeners rate each sample on a scale from 1 to 5 in 0.1 increments.

What is true is that you can conclude from a group of ABX results whether one audio file has a better chance of being audibly different from the original than another.. If, from a group of 30 people, 20 of them reliably ABX sample A but only 10 of them reliably ABX sample B, then you can conclude that sample A is more likely to be detected than sample B. You cannot, however, get a user's preference in this way; that is what user comments and preference tests are for.

I think you can make the following statement though: if you tell 30 listeners to only listen for high-frequency roll-off in the sample files A and B (and this isn't realistic, but assume that roll-off is the only artifact the listeners can detect), and 20 of them ABX sample A with a high significance but only 10 ABX sample B with a high significance, then you can say that sample A has a more audible high-frequency roll-off.

Phew, this is quite a post! But hopefully it clears up some misconceptions, rather than adding new ones.
tangent
QUOTE
Originally posted by johanfo First I'll comment former section.
- You see the your slightly conflicting posts ?   

Yes, they conflict. But what does it matter? The fact is, using confidence as the measurement of the quality of the track is wrong. Whatever correlation there is, they are very weak and not usable.

QUOTE
On the second paragraph:
I have not actively defended _that_ statement.  Instead I admit now that I put too much weight on the \"confidence - quality” correlation and that my claim that \"this track is better than that\" is ambiguous given the data.  Again, this doesn't mean I say there is no correlation, I just say that those numbers given are not sufficient to back up the statements.

Actually, the claim that "this track is better than that" is not ambigous, but simply incorrect. All the data says is that the Garf can distinguish the encodings from the original, and nothing about the quality of the encodings, as he did not make a mention about the quality.

QUOTE
I also emphasized that the comments where mine and that people must make up their own mind from the data...  Something I think most people have done here now.

I shall therefore make my own comments about the test and the data.

1. The test was not designed to meet its objective (to compare RC3-4.99 and CBR192), simply because you can never draw any significant conclusions from testing just 2 different samples. Therefore nothing significant about the codecs quality in comparison can be drawn from the results.

2. The conduct of the test was originally flawed because the original sample was not made known. Fortunately this was easily fixed. The other flaw, which is more fatal was that the test did not ask for a subjective assessment of the encoding. This defeats the purpose of the test.

3. Analysis was flawed with one bad conclusion.

4. Note that it does not matter that much that very few people participate in the test. However, if this is a case it has to be more of a subjective type test where an individual rates the quality on the 1-5 scale. Since such a subjective rating method was not used, you can't expect to get very useful results from a small test group.

Did you really read http://ff123.net and http://sjeng.sourceforge.net/audio/codecs.html ? Conducting a good listening test is not something which just anyone can do, and it takes some experience and knowledge.
tangent
QUOTE
Originally posted by shday
Ten people score 10/16, each giving a ~77% probability that their test wasn't a fluke. 

Ten people at ~77% each gives an overall confidence of more that 99.9% (much more I think) that they, as a group, didn't just get lucky.


You need to ask yourself first, under what circumstances will 10 people all score 10/16? I would say it is a fluke, and none of them can actually distinguish the encoding from the original, and it just so happens that by some kind of fluke that all 10 of them guessed correctly 10 times. Unlikely as it seems, this is the only possible explanation because either a person can really distinguish the encoding from the original or the person can't distinguish the encoding from the original. If he/she can, then he should easily score at least 12/16. If he/she can't, then the score will be random variable based on binomial distribution. The fact that none of them scored 12/16 shows that none of them can distinguish the encoding from the original, and the fact that all scored exactly 10/16 simply shows that a freakish fluke result has taken place.
2Bdecided
Well, for all your arguing, I think it was a useful test. Next time I want to encode Jumpin Jumpin to a lossy format, and the only two choices available to me are lame 192 CBR and OGG Q4.99, I know I should choose lame 192 CBR!

I will be wary of extrapolating the results beyond this conclusion.

Cheers,
David.
http://www.David.Robinson.org/
tangent
Especially since it's made on just one test sample? smile.gif
Garf
QUOTE
Originally posted by tangent
Especially since it's made on just one test sample? smile.gif


Note that he specifically pointed out 'Next time I want to encode Jumpin Jumpin...'

But really, there were only 4 listeners, so even this is tricky.

--
GCP
Garf
QUOTE
Originally posted by johanfo


The penalizing is just scaling the scores in various ways and has nothing to do with the \"infomation\" given by a test.  Wrong: -1 and Right:1 or Wrong:0 Right:1 really doesn't matter.


You can't pass on an ABX test, but it still holds, so you're correct.

--
GCP
Garf
QUOTE
Originally posted by johanfo

First I'll comment former section.
- You see the your slightly conflicting posts ?   


They don't conflict, one is the limit case of the other (no vs. perhaps sometimes who knows really small etc etc).

It's like saying Newton was all wrong the moment Einstein comes along smile.gif

QUOTE

Instead I admit now that I put too much weight on the \"confidence - quality” correlation and that my claim that \"this track is better than that\" is ambiguous given the data. Again, this doesn't mean I say there is no correlation, I just say that those numbers given are not sufficient to back up the statements.


Unless you scientifically establish a relation between ABX scores and subjective quality that holds in the circumstances of this test (*), _no amount_ of ABX scores will be sufficient to 'back up' the statements.

(*) Which I don't think is possible given you have no control over how the ABX tests are performed.

--
GCP
shday
QUOTE
Originally posted by tangent


The fact that none of them scored 12/16 shows that none of them can distinguish the encoding from the original, and the fact that all scored exactly 10/16 simply shows that a freakish fluke result has taken place.


A more realistic scenario would have been 10 people scoring 10/16 *or better*.

The point is that you can pool these test, as I've done, and get (very) useful information because the statistical power increases.

QUOTE
Originally posted by SometimesWarrior


If ten people each score 10/16 ABX, it does not mean that the aggregate score was 100/160; it means that no one person could reliably hear the difference between the two samples.


I wasn't suggesting that the probability of one person doing the test 10 times and getting 10/16 (or better) purely by chance is the same as the probability of that person doing one big test with a result of 100/160 (or better) by chance. The confidence of the pooled tests is actually even higher!

My statistics is rusty also... but I'm pretty sure I'm right about this.
erdius
So, what about Ogg 4.99, 5 or even 6 vs mp3 at VBR 192?

What is the better quality. I am struggling to choose a format for my music. I would choose mpc, but its not streamable, so I am left with mp3 and ogg...

right now i have 192 vbr mp3, but if i can get away from mp3 format, i will...
ff123
Regarding pooling of ABX results:

I think the interpretation depends on what the objective of the experiment is. Here are two different objectives:

1. Can any of the listeners distinguish between the two samples?
2. Can a group of people, the characteristics of which are represented by the listening panel, distinguish between the two samples?

If the objective is the former, just picking the best ABX results from one person will suffice. If the objective is the latter, I believe the ABX results can be pooled (assuming equal listening conditions and equal number of ABX trials).


Regarding using ABX to establish preference: If the choices are between 80% confidence and 99% confidence, then yes, I think it is valid to say that the ABX result with 80% confidence was preferred, because one couldn't even reliably detect a flaw. However, if the choices are between 95% and 99% confidence, then one had better be very careful. If one is comparing the same encoder, but at different bitrates, and one is sure that listener fatigue and listening conditions weren't a large factor, it might be valid to conclude that the 95% sample was preferred. But far better to use a test which directly asks for preference. And I wouldn't be so bold if different encoders were involved.

ff123
JohnV
One thing I think we can all agree on, is that the discussion about the results of this test are only theoretical speculation.

There's no way this test can indicate this or that regarding quality between Ogg-4.99 vs lame192cbr.
2 samples were tested and only 4 people attended, so one really can't talk about "results" here.
johanfo
QUOTE
Originally posted by tangent

I shall therefore make my own comments about the test and the data.

1. [etc... 
2. [etc... 
3. [etc... 
4. [etc... 

Conducting a good listening test is not something which just anyone can do, and it takes some experience and knowledge.

I think you are painting black here. I'm not trying to get the Nobel price from this test. Neither to perform the kind of thorough tests done by ff123. It was a simple test to see if there is any difference between CBR 192 and -q4.99. Without any confidence intervals, it seemed to me that CBR 192 had the edge. End of story.

I have even read that some people here found the test interesting. smile.gif
johanfo
QUOTE
Originally posted by JohnV
One thing I think we can all agree on, is that the discussion about the results of this test are only theoretical speculation.
There's no way this test can indicate this or that regarding quality between Ogg-4.99 vs lame192cbr.
2 samples were tested and only 4 people attended, so one really can't talk about \"results\" here.


People are now talking about a lot of things:

The test and its conclution.
I agree with you. The test gives little data from few persons listening on 2 tracks a 25 seconds. However people might find it interesting to see that there was no BIG difference. Nobody found any extreme flaws in the encodings, so there is some information here.

Second.
A correlation between quality and ABX results (not nessesarily a strong relation). Read ff123 previous post here, which I feel backs my core statements.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.