Public MP3 Listening Test @ 128 kbps - FINISHED |
![]() ![]() |
Public MP3 Listening Test @ 128 kbps - FINISHED |
Nov 28 2008, 21:02
Post
#176
|
|
|
Group: Members Posts: 365 Joined: 21-November 02 Member No.: 3830 |
You should have brought in your peers (yourself too) to inflate the sample size (no. of participants), so that the magical black bars decrease in length That's what I thought happens too, but it seems not to have had an effect: If you look at the first sample which had 39 listeners, the bars are about as long as the second sample which had 26 listeners, and definitely longer than the third sample which also had 26 listeners. It does have an effect. I never said it is the only thing that influences the error margins. What are the other factors? I mean all this time members f this forum have strongly recommended against using the mp3 encoder in itunes, Probably based on the last mp3 listening test where it SUCKED. That still seems an overly strong interpretation to me. iTunes did lose to Lame 3.95 and AudioActive, but in the results, Roberto says at the beginning: "I would like to point out two very serious issues with this test: not using the latest version of Xing, bundled with Real Player, that has been reportedly extensively tuned since version 1.5; and forcing VBR on codecs that shouldn't be using them. I'm confident iTunes MP3 would perform better if it was featured at CBR 128, and the same might apply to FhG. I take full responsability on those mistakes, and for them, I apologize." To me, "sucks" means not only losing the test (without caveats of choosing the wrong setting) but also a score below 3. |
|
|
|
Nov 28 2008, 21:11
Post
#177
|
|
|
Winamp Developer Group: Developer Posts: 662 Joined: 17-July 05 From: Ashburn, VA Member No.: 23375 |
That still seems an overly strong interpretation to me. iTunes did lose to Lame 3.95 and AudioActive, but in the results, Roberto says at the beginning: "I would like to point out two very serious issues with this test: not using the latest version of Xing, bundled with Real Player, that has been reportedly extensively tuned since version 1.5; and forcing VBR on codecs that shouldn't be using them. I'm confident iTunes MP3 would perform better if it was featured at CBR 128, and the same might apply to FhG. I take full responsability on those mistakes, and for them, I apologize." To me, "sucks" means not only losing the test (without caveats of choosing the wrong setting) but also a score below 3. Helix (somewhat-formerly Xing), iTunes and FhG were all tested with VBR this go-around. This makes the comparison to that previous test all the more relevant. |
|
|
|
Nov 28 2008, 21:20
Post
#178
|
|
|
Group: Members Posts: 365 Joined: 21-November 02 Member No.: 3830 |
Good point. I still think "sucks" lines up more with a score of 1, otherwise the wording on ABC/HR rankings should be revised
|
|
|
|
Nov 28 2008, 21:35
Post
#179
|
|
![]() Group: Developer Posts: 224 Joined: 14-September 04 Member No.: 17002 |
...the results, ... The interpretation of that result surprises me. Roberto said "Although iTunes is a little tied with Gogo, it's safe to say it lost." But actually iTunes was tied with FhG, Gogo and Xing, so it wasn't safe to say "it lost". Also Lame was tied with AActive so "Lame wins, followed by AudioActive" is also not valid. Or am I misinterpreting the graph? |
|
|
|
Nov 28 2008, 21:38
Post
#180
|
|
![]() Group: Super Moderator Posts: 9365 Joined: 1-April 04 Member No.: 13167 |
I think you're right, Raiden.
The iTunes mp3 bashing based on the results of that test has always annoyed me. This post has been edited by greynol: Nov 28 2008, 22:26 -------------------- Everything sounds the same until it is proven otherwise.
|
|
|
|
Nov 28 2008, 23:07
Post
#181
|
|
|
Group: Developer (Donating) Posts: 2332 Joined: 28-June 02 From: Argentina Member No.: 2425 |
It was at 3.04, againsts LAME's 3.74. LAME won that one, and iTunes did not offered anything better, so it sucked. -------------------- MAREO: http://www.webearce.com.ar
|
|
|
|
Nov 29 2008, 00:32
Post
#182
|
|
![]() Group: Super Moderator Posts: 9365 Joined: 1-April 04 Member No.: 13167 |
Not you too, kwanbis!
The difference between Lame and iTunes in that test could be smaller than 0.3; less than half the amount you're suggesting. Not to mention the bitrate on samples created with iTunes was consistently and significantly less than Lame. On the same token, the difference could be greater than 1.1. Even still, the conclusions people make from that test are annoying. This post has been edited by greynol: Nov 29 2008, 01:07 -------------------- Everything sounds the same until it is proven otherwise.
|
|
|
|
Nov 29 2008, 03:55
Post
#183
|
|
![]() Group: Members Posts: 9 Joined: 11-October 08 Member No.: 59933 |
that's why i'm thinking something like http://www.hydrogenaudio.org/forums/index....mp;#entry601863 could be beneficial. i think it could be useful if there was some kind of an instruction like "how to do the test properly".
i know this has already been partially discussed, but i felt like it was a good moment to mention it. also, everyone participating in such a test should look up information on how to interpret the results (maybe provide a faq or something like that?) edit: trying to correct errors This post has been edited by null-null-pi: Nov 29 2008, 04:38 -------------------- 10 FOR I=1 TO 3:PRINT"DAMN":NEXT
|
|
|
|
Nov 29 2008, 04:28
Post
#184
|
|
|
Group: Developer (Donating) Posts: 2332 Joined: 28-June 02 From: Argentina Member No.: 2425 |
Not you too, kwanbis! The difference between Lame and iTunes in that test could be smaller than 0.3; less than half the amount you're suggesting. Not to mention the bitrate on samples created with iTunes was consistently and significantly less than Lame. On the same token, the difference could be greater than 1.1. Even still, the conclusions people make from that test are annoying. Point is at that time, even LAME sucked, but it sucked the least. So with all encoders being "ho-hum", what was the point to recommend the worst one? LAME have statistically won, even by 0.3 or 1.1. So, what was the point of recommending iTunes? -------------------- MAREO: http://www.webearce.com.ar
|
|
|
|
Nov 29 2008, 04:41
Post
#185
|
|
|
Group: Members Posts: 2089 Joined: 18-December 03 Member No.: 10538 |
I think what this thread is telling us is that HA needs a wiki page on statistics....and it should be required reading before posting about public listening test results, just as we required ABX results for audio claims.
(No, I won't be writing it; I'm not a stats expert. I have a few biostatistics books on my reference shelf, and I know just enough to be suspicious of claims being made here about some codecs being better than others, based on these results. In other words, I'm with greynol: there is no statistically significant difference I can see here, for general guidance). Btw, who here does does have a solid background in statistical analysis? Just curious. |
|
|
|
Nov 29 2008, 15:42
Post
#186
|
|
|
Group: Members Posts: 2297 Joined: 9-October 05 From: Dormagen, Germany Member No.: 25015 |
A more profound knowledge about statistics can help figuring out that it's necessary to obey to the confidence intervals when judging about the overall average outcome.
The deeper problem is: which is the worth of this overall average outcome? We all want to have life easy but IMO the average outcome is considered so important by many people only because it's such a simple scheme. I personally prefer an outcome that shows that a certain candidate is good or at least not bad on all of the samples tested (or at least those samples that have a personal meaning). I'm well aware that this (or similar schemes) brings some amount of subjectivity (but does not at all make things arbitrary) and brings no general consensus, but in the end deciding on an encoder to use is an indvidual decision. This post has been edited by halb27: Nov 29 2008, 23:04 -------------------- lame3100k -V0 --cvbr 9
|
|
|
|
Nov 30 2008, 11:18
Post
#187
|
|
|
Group: Members Posts: 2297 Joined: 9-October 05 From: Dormagen, Germany Member No.: 25015 |
I wanted to give FhG a try because of this test's outcome, and I wanted to concentrate on the most serious tonal problems I know (herding_calls, trumpet, trumpet_myPrince) which I wanted to have at a non-obvious issue level which should be not a bit annoying when careful listening. And I wanted to stay at around 200 kbps maximum (having to care about file size in the near future).
What I learnt from the test is that I don't have to care about the extreme HF region and can be content with HF up to 16 kHz which is very favorable when using mp3. As a refererence I used Lame 3.98.2. I figured out that -V0.5 --lowpass 16.0 is the setting which brings the desired quality for me. With my test set of typical regular music the average bitrate for this setting is 205 kbps. Then I tried FhG surround, but didn't manage to find a competetive setting (even with lowpassing before encoding with one of the higher quality settings). So FhG isn't an alternative to me. Though I didn't want to pick up again Helix as I decided some time ago not to use it I was curious how Helix behaved. At relatively low bitrate Helix has serious problems with all the samples (especially the 'tremolo' effect with trumpet_myPrince). But from a certain point on there is a strong quality increase. With '-X2 -U2 -SBT500 -TX0 -V110' my quality demands are met. With my typical regular music test set the average bitrate for this setting is 179 kbps. As this is a good result I looked up those samples where in the past I found a subtle issue concerning HF behavior and 'vividness'. I couldn't hear an issue, probably due to my reduced demands (and limited time I wanted to spend for this small test). Struggling for lower bitrate with Lame I arrived at --abr 200 --lowpass 16.0 which gives an average bitrate of 191 kbps with my typical regular test set and meets my quality demands for these bad tonal problems. But in this bitrate range and for general usage I prefer VBR, now that I'm happy with Lame's VBR behavior. I will use Lame -V0.5 --lowpass 16.0 in the future. I have no reason to prefer it over Helix. It's just personal - my emotions are more with Lame than with Helix. This post has been edited by halb27: Nov 30 2008, 11:20 -------------------- lame3100k -V0 --cvbr 9
|
|
|
|
Nov 30 2008, 18:04
Post
#188
|
|
![]() Group: Members Posts: 304 Joined: 9-August 02 From: SoFo Member No.: 3002 |
QUOTE '-X2 -U2 -SBT500 -TX0 -V110' What do these switches do? I tried to compare the command line used in the test, '-X2 -U2 -V60' with the one suggested for metal, '-X2 -HF2 -SBT500 -TX0 -C0 -V60'. I found the latter to be harder to ABX even though I could ABX both (I only tried it on two tracks). Halb27, have you tried that command line? |
|
|
|
Nov 30 2008, 21:22
Post
#189
|
|
|
Group: Members Posts: 2297 Joined: 9-October 05 From: Dormagen, Germany Member No.: 25015 |
non-audio related switches:
-X2: MPEG compatible Xing Header -U2: encoding speedup (uses SSE) -C0: clear copyright bit audio related switches: -Vx: VBR quality (range 0-150) -HF2: makes use of HF > 16 kHz -SBTy: short block threshold -TXz: nobody seems to be able to tell about this switch. level is the one who studied these switches most intensively at VBR settings ~ -V120. As I'm in this bitrate range I simply follow his settings. I did some tests on my own with TX0...TX8 at -V110, and though I think I can hear differences they are so subtle that I personally can't say which one is best. Same goes for -SBT500 or default setting. I don't use -HF2 as I don't need frequencies beyond 16 kHz. This post has been edited by halb27: Nov 30 2008, 21:27 -------------------- lame3100k -V0 --cvbr 9
|
|
|
|
Dec 1 2008, 15:18
Post
#190
|
|
|
Group: Members Posts: 3 Joined: 1-December 08 Member No.: 63603 |
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. I'm afraid the conclusion about evaluating lower bit rates sounds as if 128 kbps would provide satisfactory sound quality!?
|
|
|
|
Dec 1 2008, 15:56
Post
#191
|
|
|
Group: Members Posts: 2297 Joined: 9-October 05 From: Dormagen, Germany Member No.: 25015 |
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. I'm afraid the conclusion about evaluating lower bit rates sounds as if 128 kbps would provide satisfactory sound quality!? I understand your concern about a lower bitrate test, as this is not attractive to me either (though there may be codecs like HE-AAC which may provide for satisfactory sound quality at 96 kbps). But as for your remark on sound system I am convinced that this is not the problem. I guess many participants are happy with their sound system. At least I am. BTW artifacts can often be heard even with a bad sound system. Why not accept the fact that 128 kbps provides satisfactory sound quality for most users most of the time? -------------------- lame3100k -V0 --cvbr 9
|
|
|
|
Dec 1 2008, 16:34
Post
#192
|
|
|
Group: Members Posts: 3 Joined: 1-December 08 Member No.: 63603 |
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. I'm afraid the conclusion about evaluating lower bit rates sounds as if 128 kbps would provide satisfactory sound quality!? I understand your concern about a lower bitrate test, as this is not attractive to me either (though there may be codecs like HE-AAC which may provide for satisfactory sound quality at 96 kbps). But as for your remark on sound system I am convinced that this is not the problem. I guess many participants are happy with their sound system. At least I am. BTW artifacts can often be heard even with a bad sound system. Why not accept the fact that 128 kbps provides satisfactory sound quality for most users most of the time? Well, ok, let me rephrase the question then: What is the point of comparing codecs to each other rather than comparing to the uncompressed reference? Correct me if I'm wrong, but wasn't the test a comparison between encoders? (With the reference still being a low bit rate compressed format.) Sure, some encoders are better then others for certain things. But certain other things are not even being considered. (Such as the pretty much non-existing ambient reflected sound field.) Without using a proper reference and a high quality audio system, tests will just be nit-picking about the lesser evil influence on this or that musical instrument, but missing entire huge aspects of sound quality! It would be a good thing to put low bitrate audio quality into some kind of overall quality context. 128 kbps is not CD quality, it's not LP quality, and it's not even close to good Philips compact cassette. It sounds nice in a non-offensive way, which is good. Even on my Nano 128 kbps is clearly a lot worse than 224 kbps. BTW, I'm not trying to start a flame war, just stating what I think are some truths that seem to be forgotten all the time. |
|
|
|
Dec 1 2008, 16:52
Post
#193
|
|
|
Group: Members Posts: 64 Joined: 18-September 08 From: Sparta, Ontario Member No.: 58419 |
Well, ok, let me rephrase the question then: What is the point of comparing codecs to each other rather than comparing to the uncompressed reference? Correct me if I'm wrong, but wasn't the test a comparison between encoders? (With the reference still being a low bit rate compressed format.) Sure, some encoders are better then others for certain things. But certain other things are not even being considered. (Such as the pretty much non-existing ambient reflected sound field.) Without using a proper reference and a high quality audio system, tests will just be nit-picking about the lesser evil influence on this or that musical instrument, but missing entire huge aspects of sound quality! I take it you didn't do the test. You should. Even though it's already over. There is a reference lossless (i believe) sample for you to compare each codec against. Even on my Nano 128 kbps is clearly a lot worse than 224 kbps. For many (if not most) people, that is simply not true. Personally I can't tell the difference on 99% of material, with careful listening on a decent set of earphones. Even for people who can spot the differences, i don't think it is 'clearly a lot worse.' This post has been edited by gerwen: Dec 1 2008, 16:54 |
|
|
|
Dec 1 2008, 16:55
Post
#194
|
|
|
Group: Members Posts: 2340 Joined: 28-August 02 Member No.: 3218 |
I wanted to give Helix a try for my portable. After a quick test artifacts at a test track (Classical, RVW "Fantasia on Christmas Carols, Hickox (rip), first seconds) disappeared at -V120 - 186 kbps av. Lame perfomed better even at V4 (I usually use V3 at portable listening). --> 162 kbps av. - still no artifacts.
However. I was surprised, Helix encodes at 32x, Lame 16x on my 3ghz intel P4 on one core. Only twice as fast as Lame? For Helix: Are there optimized compiles out there? |
|
|
|
Dec 1 2008, 17:16
Post
#195
|
|
![]() Group: Super Moderator Posts: 4887 Joined: 12-August 04 From: Exeter, UK Member No.: 16217 |
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. Better than what? You can't possibly know the equipment used by each participant.As per gerwen's suggestion, it would be great to see some ABX results from you, using the samples tested. -------------------- I'm on a horse.
|
|
|
|
Dec 1 2008, 17:28
Post
#196
|
|
|
Group: Members Posts: 3 Joined: 1-December 08 Member No.: 63603 |
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. Better than what? You can't possibly know the equipment used by each participant.As per gerwen's suggestion, it would be great to see some ABX results from you, using the samples tested. It's true that the systems used are complete unknowns. I had a look around for the samples, but couldn't locate them at: http://www.listening-tests.info/mp3-128-1/ |
|
|
|
Dec 1 2008, 17:59
Post
#197
|
|
![]() Group: Developer Posts: 3036 Joined: 2-December 07 Member No.: 49183 |
It's true that the systems used are complete unknowns. I had a look around for the samples, but couldn't locate them at: http://www.listening-tests.info/mp3-128-1/ You can find links to all samples on previous page http://www.hydrogenaudio.org/forums/index....st&p=601657 |
|
|
|
Dec 2 2008, 04:09
Post
#198
|
|
|
Group: Members Posts: 6 Joined: 23-October 07 Member No.: 48128 |
So what we can say (as conservative scientists) is: We can't be sure that there's a difference in quality between the different encoders. Nothing more. Exactly. Or at "We can't be sure that there's a difference in quality between the different encoders for this set of samples and for the participants etc..." To put the debate on statistical difference and on the practical side of the graph, I created a fake one in which I add as competitor a lossless encoding. It's not quite perfect as the confidence error margin would change a bit but I don't think a true graph would really look different: ![]() LAME 3.98 and Helix are statistically tied to any lossless format. That's a great thought experiment to help understand the statistical meaning of the test. I looked at some of the test results other people submitted, and was impressed that some people were able to ABX nearly every encoder on most samples. I (pcordes in the results) was able to ABX at least some of the encodes on every sample I had time to test, but it was hard and took a lot of time. (It was my first blind listening test, though. I definitely got better at noticing things as I found more and more kinds of artifacts...) Maybe I need new headphones; my Koss TD/60 phones are pretty old, but they're pretty good. Sometimes I used my Logitech Z-5500 speaker system instead of the phones, since it's awesome, but I had to crank the volume high enough that it got hard on my ears. I looked at some other results and was disappointed that some people only ABXed the low anchor for most samples. Has anyone tried excluding results from people with tin ears (or bad headphones or not enough time/dedication to find the subtle differences)? Could that increase the signifcance any? Or at least drag things away from 5.0? Because I certainly didn't think all the encodes were that close to transparent. (edit, I see Alex B has done just that on the sample threads: http://www.hydrogenaudio.org/forums/index....showtopic=67561. There, he excluded people who rated anything worse than l3enc, among other criteria, which is exactly the sort of thing I'm getting at. There are also statistical techniques like Jackknifing http://en.wikipedia.org/wiki/Resampling_(s...tics)#Jackknife, although that's looking at randomly chosen subsets, or all subsets, not carefully chosen subsets.) So my point is that this listening test could have been more (statistically) significant if it was limited to people who could tell the difference between the encoders. Even if I wasn't able to find an artifact on which I could ABX a sample from the reference, I'd like to know if other people with better equipment and/or better/more practiced hearing were able to. I don't remember enough statistics, and I haven't looked at the submitted scores closely enough, to see if some of the submitted results add more noise than signal. And I'm not trying to slight the efforts of people who weren't able to hear differences; It's interesting to know that all the tested encoders are close to transparent for a lot of people. I can't help thinking that there must be some people who didn't spend a long time hunting for artifacts, and submitted results with a lot of perfect scores. Which again is valid, since it means it sounded ok to them. I am _also_ (not exclusively) interested in the subset of results submitted by people who didn't rate any of the encoders perfect much of the time. Also, wouldn't it be possible to keep accepting results? You'd maybe have to generate new .erf files with a new key, since the old one is published, but even knowing how previous people rated things shouldn't be too much bias. The test is still sufficiently blind that people can't bias towards LAME on purpose (just for example Ok, that's enough ways to tread on statistical thin ice for now. This post has been edited by llama peter: Dec 2 2008, 05:27 |
|
|
|
Dec 8 2008, 19:06
Post
#199
|
|
![]() Group: Members Posts: 218 Joined: 12-October 01 Member No.: 278 |
Since the results are surprising, I wish Sebastian could moderate a discussion with three authorities.
In other words, a panel discussion with people like Roberto, a Nero rep, Menno, a LAME team member, whatever... Maybe it could be done by Skype. I think it would be better to allow participants to jump in and interrupt... {ed add'n: I'm particularly interested to know if HELIX tweaks may come} This post has been edited by ckjnigel: Dec 8 2008, 19:10 |
|
|
|
Dec 12 2008, 23:25
Post
#200
|
|
|
Group: Developer (Donating) Posts: 2332 Joined: 28-June 02 From: Argentina Member No.: 2425 |
The much awaited results of the Public, MP3 Listening Test @ 128 kbps are ready - partially. Sebastian, would you be adding more info? Or is the test finished already? -------------------- MAREO: http://www.webearce.com.ar
|
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 19th June 2013 - 15:33 |