AAC @ 128kbps listening test discussion |
![]() ![]() |
AAC @ 128kbps listening test discussion |
Mar 1 2004, 18:18
Post
#301
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
QUOTE (rjamorim @ Feb 29 2004, 11:44 PM) Can someone enlighten me on the origins of Velvet? http://lame.sourceforge.net/download/samples/velvet.wav All I know is that it was submitted by Roel (r3mix). Does anybody know artist (Velvet Underground?), title and album of this song? Also, what would be the style (no way to figure out from just the introduction) ff123 already enlightened me about it. Thank-you very much. Details are available at the listening test results page. -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 18:18
Post
#302
|
|
|
Group: Members Posts: 881 Joined: 11-October 02 Member No.: 3523 |
QUOTE (rjamorim @ Mar 1 2004, 06:12 PM) Nope. I couldn't decrypt your sample 09 results. It's the only result file that gave me problems in the entire test. I sent it to schnofler so that he can investigate. Sorry about that. damn, i shouldnt have tried to manipulate the resultfiles -------------------- I know, that I know nothing (Socrates)
|
|
|
|
Mar 5 2004, 07:49
Post
#303
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
A VERY IMPORTANT STATEMENT
OK. It seems I f-ed up very badly this time. First, let me specify what ISN'T wrong: The ranking values are absolutely correct, as well as the screening methodology and the statystical calculations. What is wrong: The error bars. I didn't check how the error bars were being drawn in the excel spreadsheet I got from ff123. I thought the plots were getting values from a certain cell, but actually the values were hard-coded in the plot building routines. So, the error bars are to this day the same as the ones used in his 64kbps listening test. And it affects all my listening tests. Both the overall plots and the individual ones. I can't express how sorry I am. Tomorrow I'll start fixingall the test results pages. Until I announce the results have been fixed, please disregard them. In case someone is in a hurry to check the corrected zoomed result plot for the AAC test: http://pessoal.onda.com.br/rjamorim/screen2.png The only thing that changed is that iTunes is now clearly first place and Nero is second place. Again, I'm terribly sorry. I can already feel my credibility going down the drain. Kind regards; Roberto Amorim. -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 5 2004, 08:00
Post
#304
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (rjamorim @ Mar 4 2004, 10:49 PM) What is wrong: The error bars. I didn't check how the error bars were being drawn in the excel spreadsheet I got from ff123. I thought the plots were getting values from a certain cell, but actually the values were hard-coded in the plot building routines. The fault is also mine for not making it perfectly clear how I was drawing the error bars. Plus I violated an Excel/software rule by not using a spreadsheet as a spreadsheet should be used, instead hard-coding in the error bar values. QUOTE Again, I'm terribly sorry. I can already feel my credibility going down the drain. Your integrity is intact. Credibility is a matter of trust. If you own up to your mistakes, correct them, and prevent future ones, that goes a long way towards enhancing your credibility. I suggest keeping both the old (incorrect) overall graphs and showing the new, corrected overall graphs side by side, to show the before and after. I think the individual sample graphs can just be replaced. ff123 Edit: You should probably rename the old overall graph and then use the original name of the graph for the corrected one. That way, websites which link to your overall graphs will be automatically updated. This post has been edited by ff123: Mar 5 2004, 08:23 |
|
|
|
Mar 5 2004, 08:12
Post
#305
|
|
|
Group: Super Moderator Posts: 332 Joined: 20-May 03 From: Pittsburgh, USA Member No.: 6718 |
QUOTE (ff123 @ Mar 5 2004, 03:00 AM) Your integrity is intact. Credibility is a matter of trust. If you own up to your mistakes, correct them, and prevent future ones, that goes a long way towards enhancing your credibility. Your integrity is, indeed, intact. I've seen a few other listening tests online, and discussion of their results always stops soon after the tests, with the page receeding in internet history. Updating these tests now goes a long way toward proving their reliability will be maintained in the future -------------------- [url=http://noveo.net/ph34r.htm]Happiness[/url] - The agreeable sensation of contemplating the misery of others.
|
|
|
|
Mar 5 2004, 09:01
Post
#306
|
|
![]() Server Admin Group: Admin Posts: 4810 Joined: 24-September 01 Member No.: 13 |
QUOTE (rjamorim @ Mar 5 2004, 08:49 AM) In case someone is in a hurry to check the corrected zoomed result plot for the AAC test: http://pessoal.onda.com.br/rjamorim/screen2.png The only thing that changed is that iTunes is now clearly first place and Nero is second place. Aaaaaah, this explains my previous complaint that the graph didn't seem to align with your written statement about the test significance Now it does. iTunes indeed almost beats Nero by a significant margin. As far as the moral winner is concerned, though: |
|
|
|
Mar 5 2004, 09:20
Post
#307
|
|
![]() Group: Members Posts: 473 Joined: 7-June 02 Member No.: 2244 |
QUOTE (Garf @ Mar 5 2004, 09:01 AM) As far as the moral winner is concerned, though: "Moral winner"? |
|
|
|
Mar 5 2004, 09:27
Post
#308
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
QUOTE (Garf @ Mar 5 2004, 05:01 AM) Now it does. iTunes indeed almost beats Nero by a significant margin. Erm.. I use Darryl's method to evaluate ranking positions. Check, for instance, thear1 in his 64kbps test results http://ff123.net/64test/results.html Oggs are ranked second, according to him, although they overlap a little with MP3pro. To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap. -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 5 2004, 09:28
Post
#309
|
|
![]() LAME developer Group: Developer Posts: 2950 Joined: 1-October 01 From: Nanterre, France Member No.: 138 |
QUOTE I can already feel my credibility going down the drain Finding, admitting, correcting your own errors only increases credibility I think. |
|
|
|
Mar 5 2004, 13:29
Post
#310
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
Your credibility, your honesty and your honor are now stronger. Thank you.
|
|
|
|
Mar 5 2004, 16:12
Post
#311
|
|
![]() Group: Banned Posts: 769 Joined: 1-July 03 Member No.: 7495 |
You have nothing to worry about, Roberto...you're credibility is quite secure. Anyone who conducts tests like this will occasionally have a mistake. It's inevitable. You took the best approach in resolving it. Our trust in you is only higher now.
QUOTE (rjamorim @ Mar 5 2004, 03:27 AM) QUOTE (Garf @ Mar 5 2004, 05:01 AM) Now it does. iTunes indeed almost beats Nero by a significant margin. ...To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap. That's what I had always thought was the case, but it was just an assumption on my part (that I never communicated). Glad to know it was correct. |
|
|
|
Mar 5 2004, 16:34
Post
#312
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
[quote=ScorLibran,Mar 5 2004, 07:12 AM] ...To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.[/QUOTE]
That's what I had always thought was the case, but it was just an assumption on my part (that I never communicated). Glad to know it was correct. [/quote] To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap. Or to put it another way, 19 times out of 20, those results would not occur by chance. Any overlap reduces that confidence. If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance. A reasonable way to describe this situation would be to say that the results are suggestive (if not significant). Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy. If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant. Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more. I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis. But others, for example on slashdot, can (and do) complain about this sort of thing. ff123 |
|
|
|
Mar 5 2004, 16:47
Post
#313
|
|
![]() Server Admin Group: Admin Posts: 4810 Joined: 24-September 01 Member No.: 13 |
I take it from the previous comment by rjamorim that 'bars' should be interpreted as 'error bars' and 'mean score marker' and not 2x 'error bars'?
|
|
|
|
Mar 5 2004, 17:05
Post
#314
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (rjamorim @ Mar 5 2004, 12:27 AM) Check, for instance, thear1 in his 64kbps test results http://ff123.net/64test/results.html Oggs are ranked second, according to him, although they overlap a little with MP3pro. In that test I used an "eyeball" method to rank the codecs when trying to determine an appropriate overall ranking. People (including me) didn't like the subjectivity involved in that method, so I changed to the method used now, which is to perform another ANOVA/Fisher LSD once the means for each music sample are determined. The assumption this method makes is that each sample is equally important to the final overall results. This may not actually be true if, for example, there are lots of people listening to some samples and only a few listening to others. Also, the choice of samples greatly affects the overall results. But at least it seems to produce reasonable results, and it's removed the subjectivity involved in the earlier method. QUOTE I take it from the previous comment by rjamorim that 'bars' should be interpreted as 'error bars' and 'mean score marker' and not 2x 'error bars'? The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD. ff123 |
|
|
|
Mar 5 2004, 17:16
Post
#315
|
|
![]() Group: Developer Posts: 2797 Joined: 22-September 01 Member No.: 6 |
QUOTE (ff123 @ Mar 5 2004, 05:34 PM) To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap. Or to put it another way, 19 times out of 20, those results would not occur by chance. Any overlap reduces that confidence. If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance. A reasonable way to describe this situation would be to say that the results are suggestive (if not significant). Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy. If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant. Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more. I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis. But others, for example on slashdot, can (and do) complain about this sort of thing. ff123 Right, well, with 95% confidence for the tested 12 samples: iTunes is better than Real,FAAC and Compaact Nero is better than Real and Compaact With lower confidence for the tested 12 samples: Nero is better than FAAC (small overlap) With even lower confidence for the tested 12 samples: iTunes is better than Nero (a bit bigger overlap than with Nero-FAAC) Correct? -------------------- Juha Laaksonheimo
|
|
|
|
Mar 5 2004, 17:36
Post
#316
|
|
![]() Server Admin Group: Admin Posts: 4810 Joined: 24-September 01 Member No.: 13 |
QUOTE (ff123 @ Mar 5 2004, 06:05 PM) The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD. So there shouldn't be any overlap between error bars at all, if I get that correctly, since no overlap between error bar and mean is only half the error length. (And hence my original comment was right). |
|
|
|
Mar 5 2004, 17:53
Post
#317
|
|
|
Group: Banned Posts: 6 Joined: 5-March 04 Member No.: 12484 |
QUOTE (rjamorim @ Mar 5 2004, 12:27 AM) QUOTE (Garf @ Mar 5 2004, 05:01 AM) Now it does. iTunes indeed almost beats Nero by a significant margin. Erm.. I use Darryl's method to evaluate ranking positions. Check, for instance, thear1 in his 64kbps test results http://ff123.net/64test/results.html Oggs are ranked second, according to him, although they overlap a little with MP3pro. To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap. how about this one? where is the truth? |
|
|
|
Mar 5 2004, 18:05
Post
#318
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (Garf @ Mar 5 2004, 08:36 AM) QUOTE (ff123 @ Mar 5 2004, 06:05 PM) The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD. So there shouldn't be any overlap between error bars at all, if I get that correctly, since no overlap between error bar and mean is only half the error length. (And hence my original comment was right). Yes. If the error bars do not overlap, that is a difference to 95% confidence. And yes, iTunes almost beats Nero with 95% confidence. |
|
|
|
Mar 5 2004, 18:12
Post
#319
|
|
![]() Group: Members Posts: 265 Joined: 15-December 03 Member No.: 10452 |
Is there anything in the testig methodology to assure that iTunes does not sound "better" than the original CD through the addition of some audio "sugar"?
I hope the experts around here do not think this is too off the wall. For that matter I don't know if there is a way to make any recording sound "better" than the original. |
|
|
|
Mar 5 2004, 18:16
Post
#320
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (Zed @ Mar 5 2004, 08:53 AM) The biggest weakness of this test IMO is that there were only 3 samples tested, and they made it even worse by combining them into one medley. Other problems: IIRC, people were asked to rank the codecs from best to worst, not to compare and rate against a known reference. I believe the reference was hidden as one of the samples to be ranked. But the 3 sample medley is really the killer. They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall. ff123 |
|
|
|
Mar 5 2004, 18:17
Post
#321
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
Hello.
Thank-you very much for your support I have been correcting the plots (will upload them later) and so far, it seems very few will change: -At the first AAC@128kbps test, it only becomes more clear that QuickTime is the winner. -At the Extension test, it seems Vorbis and WMAPro are no longer tied to AAC and MPC, and now share second place. I'll leave it to others to discuss. -The 64kbps test results stay the same: Lame wins, followed by HE AAC, then MP3pro, then Vorbis. LC AAC, Real and WMA are still tied at fifth place, and FhG MP3 is still way down the graph. -The MP3 test stays the same as well. Regards; Roberto. This post has been edited by rjamorim: Mar 5 2004, 18:24 -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 5 2004, 18:17
Post
#322
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (eagleray @ Mar 5 2004, 09:12 AM) Is there anything in the testig methodology to assure that iTunes does not sound "better" than the original CD through the addition of some audio "sugar"? I hope the experts around here do not think this is too off the wall. For that matter I don't know if there is a way to make any recording sound "better" than the original. Yes, the listener is asked to rate the sample against the reference. The reference is 5.0 by default, so any difference, even if it "sounds better" than the reference must be rated less than 5.0 ff123 |
|
|
|
Mar 5 2004, 18:28
Post
#323
|
|
|
Group: Banned Posts: 6 Joined: 5-March 04 Member No.: 12484 |
QUOTE (ff123 @ Mar 5 2004, 09:16 AM) But the 3 sample medley is really the killer. They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall. but small number of the ears is also the killer i guess... |
|
|
|
Mar 5 2004, 18:57
Post
#324
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (Zed @ Mar 5 2004, 09:28 AM) QUOTE (ff123 @ Mar 5 2004, 09:16 AM) But the 3 sample medley is really the killer. They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall. but small number of the ears is also the killer i guess... They had about 3000 listeners for both the 64 kbit/s and 128 kbit/s tests. If they had distributed 50 separate samples instead of the one medley, they could have gotten more than 50 listeners per sample. That's more than enough to make a statistical inference. In fact, one can do quite well with far fewer. ff123 |
|
|
|
Mar 5 2004, 19:07
Post
#325
|
|
![]() Server Admin Group: Admin Posts: 4810 Joined: 24-September 01 Member No.: 13 |
The test also seems at least 1.5 years old. Lots has happened in that time with AAC.
This post has been edited by Garf: Mar 5 2004, 19:08 |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 19th June 2013 - 17:12 |