IPB

Welcome Guest ( Log In | Register )

9 Pages V  « < 6 7 8 9 >  
Reply to this topicStart new topic
Public MP3 Listening Test @ 128 kbps - FINISHED
singaiya
post Nov 28 2008, 21:02
Post #176





Group: Members
Posts: 365
Joined: 21-November 02
Member No.: 3830



QUOTE (Sebastian Mares @ Nov 26 2008, 00:06) *
QUOTE (singaiya @ Nov 26 2008, 05:46) *

QUOTE (sld @ Nov 25 2008, 20:14) *

You should have brought in your peers (yourself too) to inflate the sample size (no. of participants), so that the magical black bars decrease in length


That's what I thought happens too, but it seems not to have had an effect: If you look at the first sample which had 39 listeners, the bars are about as long as the second sample which had 26 listeners, and definitely longer than the third sample which also had 26 listeners.


It does have an effect. I never said it is the only thing that influences the error margins. wink.gif


What are the other factors?

QUOTE (kwanbis @ Nov 28 2008, 03:42) *
QUOTE (jimmy69 @ Nov 28 2008, 01:48) *

I mean all this time members f this forum have strongly recommended against using the mp3 encoder in itunes,

Probably based on the last mp3 listening test where it SUCKED.


That still seems an overly strong interpretation to me. iTunes did lose to Lame 3.95 and AudioActive, but in the results, Roberto says at the beginning: "I would like to point out two very serious issues with this test: not using the latest version of Xing, bundled with Real Player, that has been reportedly extensively tuned since version 1.5; and forcing VBR on codecs that shouldn't be using them. I'm confident iTunes MP3 would perform better if it was featured at CBR 128, and the same might apply to FhG. I take full responsability on those mistakes, and for them, I apologize."

To me, "sucks" means not only losing the test (without caveats of choosing the wrong setting) but also a score below 3.
Go to the top of the page
+Quote Post
benski
post Nov 28 2008, 21:11
Post #177


Winamp Developer


Group: Developer
Posts: 669
Joined: 17-July 05
From: Ashburn, VA
Member No.: 23375



QUOTE (singaiya @ Nov 28 2008, 15:02) *
That still seems an overly strong interpretation to me. iTunes did lose to Lame 3.95 and AudioActive, but in the results, Roberto says at the beginning: "I would like to point out two very serious issues with this test: not using the latest version of Xing, bundled with Real Player, that has been reportedly extensively tuned since version 1.5; and forcing VBR on codecs that shouldn't be using them. I'm confident iTunes MP3 would perform better if it was featured at CBR 128, and the same might apply to FhG. I take full responsability on those mistakes, and for them, I apologize."

To me, "sucks" means not only losing the test (without caveats of choosing the wrong setting) but also a score below 3.


Helix (somewhat-formerly Xing), iTunes and FhG were all tested with VBR this go-around. This makes the comparison to that previous test all the more relevant.
Go to the top of the page
+Quote Post
singaiya
post Nov 28 2008, 21:20
Post #178





Group: Members
Posts: 365
Joined: 21-November 02
Member No.: 3830



Good point. I still think "sucks" lines up more with a score of 1, otherwise the wording on ABC/HR rankings should be revised smile.gif
Go to the top of the page
+Quote Post
Raiden
post Nov 28 2008, 21:35
Post #179





Group: Developer
Posts: 224
Joined: 14-September 04
Member No.: 17002



QUOTE (singaiya @ Nov 28 2008, 22:02) *
...the results, ...

The interpretation of that result surprises me. Roberto said "Although iTunes is a little tied with Gogo, it's safe to say it lost." But actually iTunes was tied with FhG, Gogo and Xing, so it wasn't safe to say "it lost".
Also Lame was tied with AActive so "Lame wins, followed by AudioActive" is also not valid.
Or am I misinterpreting the graph?
Go to the top of the page
+Quote Post
greynol
post Nov 28 2008, 21:38
Post #180





Group: Super Moderator
Posts: 10000
Joined: 1-April 04
From: San Francisco
Member No.: 13167



I think you're right, Raiden.

The iTunes mp3 bashing based on the results of that test has always annoyed me.

This post has been edited by greynol: Nov 28 2008, 22:26


--------------------
Your eyes cannot hear.
Go to the top of the page
+Quote Post
kwanbis
post Nov 28 2008, 23:07
Post #181





Group: Developer (Donating)
Posts: 2353
Joined: 28-June 02
From: Argentina
Member No.: 2425




It was at 3.04, againsts LAME's 3.74. LAME won that one, and iTunes did not offered anything better, so it sucked.


--------------------
MAREO: http://www.webearce.com.ar
Go to the top of the page
+Quote Post
greynol
post Nov 29 2008, 00:32
Post #182





Group: Super Moderator
Posts: 10000
Joined: 1-April 04
From: San Francisco
Member No.: 13167



Not you too, kwanbis!

The difference between Lame and iTunes in that test could be smaller than 0.3; less than half the amount you're suggesting. Not to mention the bitrate on samples created with iTunes was consistently and significantly less than Lame.

On the same token, the difference could be greater than 1.1. Even still, the conclusions people make from that test are annoying. dry.gif

This post has been edited by greynol: Nov 29 2008, 01:07


--------------------
Your eyes cannot hear.
Go to the top of the page
+Quote Post
null-null-pi
post Nov 29 2008, 03:55
Post #183





Group: Members
Posts: 9
Joined: 11-October 08
Member No.: 59933



that's why i'm thinking something like http://www.hydrogenaudio.org/forums/index....mp;#entry601863 could be beneficial. i think it could be useful if there was some kind of an instruction like "how to do the test properly".
i know this has already been partially discussed, but i felt like it was a good moment to mention it.
also, everyone participating in such a test should look up information on how to interpret the results (maybe provide a faq or something like that?)

edit: trying to correct errors

This post has been edited by null-null-pi: Nov 29 2008, 04:38


--------------------
10 FOR I=1 TO 3:PRINT"DAMN":NEXT
Go to the top of the page
+Quote Post
kwanbis
post Nov 29 2008, 04:28
Post #184





Group: Developer (Donating)
Posts: 2353
Joined: 28-June 02
From: Argentina
Member No.: 2425



QUOTE (greynol @ Nov 28 2008, 23:32) *
Not you too, kwanbis!

The difference between Lame and iTunes in that test could be smaller than 0.3; less than half the amount you're suggesting. Not to mention the bitrate on samples created with iTunes was consistently and significantly less than Lame.

On the same token, the difference could be greater than 1.1. Even still, the conclusions people make from that test are annoying.

Point is at that time, even LAME sucked, but it sucked the least. So with all encoders being "ho-hum", what was the point to recommend the worst one? LAME have statistically won, even by 0.3 or 1.1. So, what was the point of recommending iTunes?


--------------------
MAREO: http://www.webearce.com.ar
Go to the top of the page
+Quote Post
krabapple
post Nov 29 2008, 04:41
Post #185





Group: Members
Posts: 2159
Joined: 18-December 03
Member No.: 10538



I think what this thread is telling us is that HA needs a wiki page on statistics....and it should be required reading before posting about public listening test results, just as we required ABX results for audio claims.

(No, I won't be writing it; I'm not a stats expert. I have a few biostatistics books on my reference shelf, and I know just enough to be suspicious of claims being made here about some codecs being better than others, based on these results. In other words, I'm with greynol: there is no statistically significant difference I can see here, for general guidance).

Btw, who here does does have a solid background in statistical analysis? Just curious.
Go to the top of the page
+Quote Post
halb27
post Nov 29 2008, 15:42
Post #186





Group: Members
Posts: 2414
Joined: 9-October 05
From: Dormagen, Germany
Member No.: 25015



A more profound knowledge about statistics can help figuring out that it's necessary to obey to the confidence intervals when judging about the overall average outcome.
The deeper problem is: which is the worth of this overall average outcome? We all want to have life easy but IMO the average outcome is considered so important by many people only because it's such a simple scheme.
I personally prefer an outcome that shows that a certain candidate is good or at least not bad on all of the samples tested (or at least those samples that have a personal meaning). I'm well aware that this (or similar schemes) brings some amount of subjectivity (but does not at all make things arbitrary) and brings no general consensus, but in the end deciding on an encoder to use is an indvidual decision.

This post has been edited by halb27: Nov 29 2008, 23:04


--------------------
lame3100m --bCVBR 300
Go to the top of the page
+Quote Post
halb27
post Nov 30 2008, 11:18
Post #187





Group: Members
Posts: 2414
Joined: 9-October 05
From: Dormagen, Germany
Member No.: 25015



I wanted to give FhG a try because of this test's outcome, and I wanted to concentrate on the most serious tonal problems I know (herding_calls, trumpet, trumpet_myPrince) which I wanted to have at a non-obvious issue level which should be not a bit annoying when careful listening. And I wanted to stay at around 200 kbps maximum (having to care about file size in the near future).

What I learnt from the test is that I don't have to care about the extreme HF region and can be content with HF up to 16 kHz which is very favorable when using mp3.

As a refererence I used Lame 3.98.2. I figured out that -V0.5 --lowpass 16.0 is the setting which brings the desired quality for me. With my test set of typical regular music the average bitrate for this setting is 205 kbps.

Then I tried FhG surround, but didn't manage to find a competetive setting (even with lowpassing before encoding with one of the higher quality settings). So FhG isn't an alternative to me.

Though I didn't want to pick up again Helix as I decided some time ago not to use it I was curious how Helix behaved. At relatively low bitrate Helix has serious problems with all the samples (especially the 'tremolo' effect with trumpet_myPrince). But from a certain point on there is a strong quality increase. With '-X2 -U2 -SBT500 -TX0 -V110' my quality demands are met. With my typical regular music test set the average bitrate for this setting is 179 kbps. As this is a good result I looked up those samples where in the past I found a subtle issue concerning HF behavior and 'vividness'. I couldn't hear an issue, probably due to my reduced demands (and limited time I wanted to spend for this small test).

Struggling for lower bitrate with Lame I arrived at --abr 200 --lowpass 16.0 which gives an average bitrate of 191 kbps with my typical regular test set and meets my quality demands for these bad tonal problems.
But in this bitrate range and for general usage I prefer VBR, now that I'm happy with Lame's VBR behavior.

I will use Lame -V0.5 --lowpass 16.0 in the future. I have no reason to prefer it over Helix. It's just personal - my emotions are more with Lame than with Helix.

This post has been edited by halb27: Nov 30 2008, 11:20


--------------------
lame3100m --bCVBR 300
Go to the top of the page
+Quote Post
DigitalDictator
post Nov 30 2008, 18:04
Post #188





Group: Members
Posts: 313
Joined: 9-August 02
From: SoFo
Member No.: 3002



QUOTE
'-X2 -U2 -SBT500 -TX0 -V110'

What do these switches do?

I tried to compare the command line used in the test, '-X2 -U2 -V60' with the one suggested for metal,

'-X2 -HF2 -SBT500 -TX0 -C0 -V60'.

I found the latter to be harder to ABX even though I could ABX both (I only tried it on two tracks). Halb27, have you tried that command line?
Go to the top of the page
+Quote Post
halb27
post Nov 30 2008, 21:22
Post #189





Group: Members
Posts: 2414
Joined: 9-October 05
From: Dormagen, Germany
Member No.: 25015



non-audio related switches:
-X2: MPEG compatible Xing Header
-U2: encoding speedup (uses SSE)
-C0: clear copyright bit

audio related switches:
-Vx: VBR quality (range 0-150)
-HF2: makes use of HF > 16 kHz
-SBTy: short block threshold
-TXz: nobody seems to be able to tell about this switch.

level is the one who studied these switches most intensively at VBR settings ~ -V120. As I'm in this bitrate range I simply follow his settings. I did some tests on my own with TX0...TX8 at -V110, and though I think I can hear differences they are so subtle that I personally can't say which one is best. Same goes for -SBT500 or default setting.
I don't use -HF2 as I don't need frequencies beyond 16 kHz.

This post has been edited by halb27: Nov 30 2008, 21:27


--------------------
lame3100m --bCVBR 300
Go to the top of the page
+Quote Post
pfloding
post Dec 1 2008, 15:18
Post #190





Group: Members
Posts: 3
Joined: 1-December 08
Member No.: 63603



Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. I'm afraid the conclusion about evaluating lower bit rates sounds as if 128 kbps would provide satisfactory sound quality!?
Go to the top of the page
+Quote Post
halb27
post Dec 1 2008, 15:56
Post #191





Group: Members
Posts: 2414
Joined: 9-October 05
From: Dormagen, Germany
Member No.: 25015



QUOTE (pfloding @ Dec 1 2008, 16:18) *
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. I'm afraid the conclusion about evaluating lower bit rates sounds as if 128 kbps would provide satisfactory sound quality!?

I understand your concern about a lower bitrate test, as this is not attractive to me either (though there may be codecs like HE-AAC which may provide for satisfactory sound quality at 96 kbps).
But as for your remark on sound system I am convinced that this is not the problem. I guess many participants are happy with their sound system. At least I am. BTW artifacts can often be heard even with a bad sound system. Why not accept the fact that 128 kbps provides satisfactory sound quality for most users most of the time?


--------------------
lame3100m --bCVBR 300
Go to the top of the page
+Quote Post
pfloding
post Dec 1 2008, 16:34
Post #192





Group: Members
Posts: 3
Joined: 1-December 08
Member No.: 63603



QUOTE (halb27 @ Dec 1 2008, 14:56) *
QUOTE (pfloding @ Dec 1 2008, 16:18) *

Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation. I'm afraid the conclusion about evaluating lower bit rates sounds as if 128 kbps would provide satisfactory sound quality!?

I understand your concern about a lower bitrate test, as this is not attractive to me either (though there may be codecs like HE-AAC which may provide for satisfactory sound quality at 96 kbps).
But as for your remark on sound system I am convinced that this is not the problem. I guess many participants are happy with their sound system. At least I am. BTW artifacts can often be heard even with a bad sound system. Why not accept the fact that 128 kbps provides satisfactory sound quality for most users most of the time?


Well, ok, let me rephrase the question then: What is the point of comparing codecs to each other rather than comparing to the uncompressed reference? Correct me if I'm wrong, but wasn't the test a comparison between encoders? (With the reference still being a low bit rate compressed format.)

Sure, some encoders are better then others for certain things. But certain other things are not even being considered. (Such as the pretty much non-existing ambient reflected sound field.) Without using a proper reference and a high quality audio system, tests will just be nit-picking about the lesser evil influence on this or that musical instrument, but missing entire huge aspects of sound quality!

It would be a good thing to put low bitrate audio quality into some kind of overall quality context. 128 kbps is not CD quality, it's not LP quality, and it's not even close to good Philips compact cassette. It sounds nice in a non-offensive way, which is good. Even on my Nano 128 kbps is clearly a lot worse than 224 kbps.

BTW, I'm not trying to start a flame war, just stating what I think are some truths that seem to be forgotten all the time.
Go to the top of the page
+Quote Post
gerwen
post Dec 1 2008, 16:52
Post #193





Group: Members
Posts: 64
Joined: 18-September 08
From: Sparta, Ontario
Member No.: 58419



QUOTE (pfloding @ Dec 1 2008, 10:34) *
Well, ok, let me rephrase the question then: What is the point of comparing codecs to each other rather than comparing to the uncompressed reference? Correct me if I'm wrong, but wasn't the test a comparison between encoders? (With the reference still being a low bit rate compressed format.)

Sure, some encoders are better then others for certain things. But certain other things are not even being considered. (Such as the pretty much non-existing ambient reflected sound field.) Without using a proper reference and a high quality audio system, tests will just be nit-picking about the lesser evil influence on this or that musical instrument, but missing entire huge aspects of sound quality!

I take it you didn't do the test. You should. Even though it's already over. There is a reference lossless (i believe) sample for you to compare each codec against.

QUOTE (pfloding @ Dec 1 2008, 10:34) *
Even on my Nano 128 kbps is clearly a lot worse than 224 kbps.

For many (if not most) people, that is simply not true. Personally I can't tell the difference on 99% of material, with careful listening on a decent set of earphones. Even for people who can spot the differences, i don't think it is 'clearly a lot worse.'

This post has been edited by gerwen: Dec 1 2008, 16:54
Go to the top of the page
+Quote Post
Squeller
post Dec 1 2008, 16:55
Post #194





Group: Members
Posts: 2351
Joined: 28-August 02
Member No.: 3218



I wanted to give Helix a try for my portable. After a quick test artifacts at a test track (Classical, RVW "Fantasia on Christmas Carols, Hickox (rip), first seconds) disappeared at -V120 - 186 kbps av. Lame perfomed better even at V4 (I usually use V3 at portable listening). --> 162 kbps av. - still no artifacts.

However. I was surprised, Helix encodes at 32x, Lame 16x on my 3ghz intel P4 on one core. Only twice as fast as Lame? For Helix: Are there optimized compiles out there?
Go to the top of the page
+Quote Post
Synthetic Soul
post Dec 1 2008, 17:16
Post #195





Group: Super Moderator
Posts: 4887
Joined: 12-August 04
From: Exeter, UK
Member No.: 16217



QUOTE (pfloding @ Dec 1 2008, 14:18) *
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation.
Better than what? You can't possibly know the equipment used by each participant.

As per gerwen's suggestion, it would be great to see some ABX results from you, using the samples tested.


--------------------
I'm on a horse.
Go to the top of the page
+Quote Post
pfloding
post Dec 1 2008, 17:28
Post #196





Group: Members
Posts: 3
Joined: 1-December 08
Member No.: 63603



QUOTE (Synthetic Soul @ Dec 1 2008, 16:16) *
QUOTE (pfloding @ Dec 1 2008, 14:18) *
Rather than moving to even lower bit rates, I suggest getting a better sound system for evaluation.
Better than what? You can't possibly know the equipment used by each participant.

As per gerwen's suggestion, it would be great to see some ABX results from you, using the samples tested.


It's true that the systems used are complete unknowns.

I had a look around for the samples, but couldn't locate them at:

http://www.listening-tests.info/mp3-128-1/
Go to the top of the page
+Quote Post
lvqcl
post Dec 1 2008, 17:59
Post #197





Group: Developer
Posts: 3209
Joined: 2-December 07
Member No.: 49183



QUOTE (pfloding @ Dec 1 2008, 19:28) *
It's true that the systems used are complete unknowns.

I had a look around for the samples, but couldn't locate them at:

http://www.listening-tests.info/mp3-128-1/


You can find links to all samples on previous page rolleyes.gif
http://www.hydrogenaudio.org/forums/index....st&p=601657
Go to the top of the page
+Quote Post
llama peter
post Dec 2 2008, 04:09
Post #198





Group: Members
Posts: 6
Joined: 23-October 07
Member No.: 48128



QUOTE (guruboolez @ Nov 26 2008, 14:39) *
QUOTE (Big_Berny @ Nov 26 2008, 19:58) *

So what we can say (as conservative scientists) is: We can't be sure that there's a difference in quality between the different encoders. Nothing more.

Exactly. Or at "We can't be sure that there's a difference in quality between the different encoders for this set of samples and for the participants etc..."

To put the debate on statistical difference and on the practical side of the graph, I created a fake one in which I add as competitor a lossless encoding. It's not quite perfect as the confidence error margin would change a bit but I don't think a true graph would really look different:



LAME 3.98 and Helix are statistically tied to any lossless format.


That's a great thought experiment to help understand the statistical meaning of the test.

I looked at some of the test results other people submitted, and was impressed that some people were able to ABX nearly every encoder on most samples. I (pcordes in the results) was able to ABX at least some of the encodes on every sample I had time to test, but it was hard and took a lot of time. (It was my first blind listening test, though. I definitely got better at noticing things as I found more and more kinds of artifacts...) Maybe I need new headphones; my Koss TD/60 phones are pretty old, but they're pretty good. Sometimes I used my Logitech Z-5500 speaker system instead of the phones, since it's awesome, but I had to crank the volume high enough that it got hard on my ears.

I looked at some other results and was disappointed that some people only ABXed the low anchor for most samples. Has anyone tried excluding results from people with tin ears (or bad headphones or not enough time/dedication to find the subtle differences)? Could that increase the signifcance any? Or at least drag things away from 5.0? Because I certainly didn't think all the encodes were that close to transparent. (edit, I see Alex B has done just that on the sample threads: http://www.hydrogenaudio.org/forums/index....showtopic=67561. There, he excluded people who rated anything worse than l3enc, among other criteria, which is exactly the sort of thing I'm getting at. There are also statistical techniques like Jackknifing http://en.wikipedia.org/wiki/Resampling_(s...tics)#Jackknife, although that's looking at randomly chosen subsets, or all subsets, not carefully chosen subsets.)

So my point is that this listening test could have been more (statistically) significant if it was limited to people who could tell the difference between the encoders. Even if I wasn't able to find an artifact on which I could ABX a sample from the reference, I'd like to know if other people with better equipment and/or better/more practiced hearing were able to. I don't remember enough statistics, and I haven't looked at the submitted scores closely enough, to see if some of the submitted results add more noise than signal. And I'm not trying to slight the efforts of people who weren't able to hear differences; It's interesting to know that all the tested encoders are close to transparent for a lot of people. I can't help thinking that there must be some people who didn't spend a long time hunting for artifacts, and submitted results with a lot of perfect scores. Which again is valid, since it means it sounded ok to them. I am _also_ (not exclusively) interested in the subset of results submitted by people who didn't rate any of the encoders perfect much of the time.

Also, wouldn't it be possible to keep accepting results? You'd maybe have to generate new .erf files with a new key, since the old one is published, but even knowing how previous people rated things shouldn't be too much bias. The test is still sufficiently blind that people can't bias towards LAME on purpose (just for example tongue.gif ). Err, I guess they could if they look at people's comments about what sounded bad in each sample. That could tip them off to which sample was which if they can hear the same thing; I guess that answers my question. Although you could just ask people not to look at people's detailed results while doing the test. if people wanted to bias the themselves, they could have used a Java debugger, or used strace -efile to see what .wav ABC/HR was opening, so you're always depending on people's honesty anyway.

Ok, that's enough ways to tread on statistical thin ice for now. tongue.gif

This post has been edited by llama peter: Dec 2 2008, 05:27
Go to the top of the page
+Quote Post
ckjnigel
post Dec 8 2008, 19:06
Post #199





Group: Members
Posts: 218
Joined: 12-October 01
Member No.: 278



Since the results are surprising, I wish Sebastian could moderate a discussion with three authorities.
In other words, a panel discussion with people like Roberto, a Nero rep, Menno, a LAME team member, whatever...
Maybe it could be done by Skype. I think it would be better to allow participants to jump in and interrupt...
{ed add'n: I'm particularly interested to know if HELIX tweaks may come}

This post has been edited by ckjnigel: Dec 8 2008, 19:10
Go to the top of the page
+Quote Post
kwanbis
post Dec 12 2008, 23:25
Post #200





Group: Developer (Donating)
Posts: 2353
Joined: 28-June 02
From: Argentina
Member No.: 2425



QUOTE (Sebastian Mares @ Nov 24 2008, 21:30) *
The much awaited results of the Public, MP3 Listening Test @ 128 kbps are ready - partially.

Sebastian, would you be adding more info? Or is the test finished already?


--------------------
MAREO: http://www.webearce.com.ar
Go to the top of the page
+Quote Post

9 Pages V  « < 6 7 8 9 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 19th April 2014 - 01:30