Sebastian Mares
Nov 24 2008, 22:30
The much awaited results of the Public, MP3 Listening Test @ 128 kbps are ready - partially. So far, I only uploaded an overall plot along with a zoomed version. The details will be available in the next few days. You can also download the encryption key along with the submitted results on the results page that is located here:
http://www.listening-tests.info/mp3-128-1/results.htmThe results show that all encoders are tied on first place, except l3enc which of course comes out last being the low anchor.
What is interesting to see is how the MP3 codec actually evolved since its first days (l3enc was the first MP3 software encoder back in 1994 when it was released) and how it is still competitive with newer formats like AAC or Ogg Vorbis.
Another very interesting thing, which was also one of the goals for this test, is that Fraunhofer and especially Helix, which both outperform LAME in terms of encoding speed, are still very competitive. While statistically being tied to LAME on first place, Helix actually even received a higher rating than LAME 3.98.2 - and this at 90x encoding speed! Even FhG received a slightly higher score at least against LAME 3.97 which was the recommended encoder by the Hydrogenaudio community for a long time. But again, statistically, they are all tied so there is no quality winner.

The quality at 128 kbps is very good and MP3 encoders improved a lot since the last test. This was the last test conducted by me at this bitrate. It's time to move to bitrates like 96 kbps or 80 kbps.
Here is a zoomed version of the plot showing the competitors only and leaving out the low anchor l3enc.

Finally, I would like to thank everyone who participated!
EDIT: Whoops, the link to the results was pointing to the 64 kbps multiformat test by mistake. Corrected now.
Wow am really shocked, Helix (Xing) has performed well. Not only did GNR's new album came out; Xing (Helix) has outperformed LAME; Hell most be pretty cold now
Sebastian Mares
Nov 24 2008, 22:56
I kept telling you guys that the results will be quite surprising.

If you analyze the decrypted results, you will see that for at least one sample, Helix is even statistically better than all other encoders.
Neasden
Nov 24 2008, 23:00
Does that make Helix the new recommended MP3 encoder, or has it to be LAME because it's open source?
Edit: Both are open source.
QUOTE (Neasden @ Nov 24 2008, 23:00)

Does that make Helix the new recommended MP3 encoder, or has it to be LAME because it's open source?
Helix is open source aswell.
Neasden
Nov 24 2008, 23:14
Yes it is, I just noticed it!
I can't believe how fast it is... here encodes at 33x realtime. LAME fastest speed here is 12x.
Foobar2000 parameters: -X2 -U2 -V150 -HF - %d
Would that be equivalent to LAME -V0 ?
greynol
Nov 24 2008, 23:14
I don't think open source has anything to do with it.
Sebastian Mares
Nov 24 2008, 23:14
If you submitted results, I recommend you look at them and choose your encoder of choice based on that.
halb27
Nov 24 2008, 23:22
I am curious about the detailed results as my interest is in worst case behavior in the first place.
I guess Lame 3.98.2 and Helix will be the winners in this respect too, maybe quality difference towards the contenders will be more remarkable in this field.
Neasden
Nov 24 2008, 23:25
They are all techincally tied, but Helix outperformed all of them. Also, the encoding speed compared to LAME is absurd faster. Could these two arguments qualify Helix for the new recommended MP3 encoder? (LAME being the second recommended)
Thank you very much Sebastian. We have some things to carefully consider, I see.
DigitalDictator
Nov 24 2008, 23:56
This is indeed surprising. I'm sure I've seen smaller, recent, ABX-tests where Lame has outperformed Helix quite clearly. I think Guruboolez and maybe also Halb27 have done a couple, but I might be mistaken.
IIRC Helix has a very simple code. If it is open source, can it be tuned further by third party? The latest version is from 2005, or no?
This is only 128k tests, so Helix in the winner in this niche. More sets needed

Helix can be recomended for 128k encoding.
Neasden
Nov 25 2008, 00:07
I encoded a few tracks using -V150 (VBR range 0-150) with -HF (high frequencies encoding) enabled. The average bitrate goes up to 270 kbps. That would be equivalent to LAME -V0. People find -V0 and -V2 already excessive, and there is this ghost-case about the sbf21 bloat... many are -V3 advocates. Helix is a simpler encoder indeed with fewer switch options, but isn't its quality and speed an outstanding alert?
greynol
Nov 25 2008, 00:10
>Helix can be recomended for 128k encoding.
How can you say this when all the contenders were tied within the 95% margin of confidence?
Some other things to consider before leaping to such a conclusion:
- How many participants were there?
- Did Helix consistently score higher amongst all participants?
halb27
Nov 25 2008, 00:11
QUOTE (DigitalDictator @ Nov 25 2008, 00:56)

I'm sure I've seen smaller, recent, ABX-tests where Lame has outperformed Helix quite clearly. I think Guruboolez and maybe also Halb27 have done a couple, but I might be mistaken. ...
Sorry: As far as I am concerned I didn't do a recent ABX test Helix vs. Lame 3.98.
A few years back after level and others stating Helix' remarkable quality in the 200 kbps area I did ABX Helix and some other encoders and I valued Helix' robustness against artifacts in this bitrate area very high. There were some more or less (to me) negligible HF issues however which Wombat found and I also found some cases with a subtle but ABXable lack of 'vividness' (don't know how to describe it). All these things relevant only in the very high bitrate range if at all. But as I don't care about bitrate much and as I was happy with Lame ABR 270 (3.90.3 then) I sticked with Lame with my mp3 encodings.
Pio2001
Nov 25 2008, 00:34
AAAAAAAAAAAARRRRRRRRGGGGHGHHHHHH !!
They're all tied
How can it be ? I ABXed nearly all the samples that I submitted, and there were important differences in quality between them. Could it be that for every sample, the best to worst order was
different ? Or were there too many tied submissions ?
I must check my personal results right now !
Oh, and Greynol is right. Helix is not winner. It is tied. The differences are within the confidence intervals, which means that they are just random. If you redid the test with the same samples and same listeners, the simple fact the ABC/HR presents them in a different order every time would probably lead Lame, or Fraunhofer, or iTunes to get a slightly, but not significantly, superior score.
We must consider this to be chance, unless we have more information to backup further claims.
kwanbis
Nov 25 2008, 00:42
Wow (even if the difference between LAME 3.98.2 and Helix is 0.08 and knowing that both (all) are statistically tied.)
EDIT: Congratulations to Sebastian for conducting the test.
halb27
Nov 25 2008, 00:43
QUOTE (Neasden @ Nov 25 2008, 00:25)

... Could these two arguments qualify Helix for the new recommended MP3 encoder? ...
I've never been too happy with recommendations especially when it's about just one encoder.
I was especially unhappy with recommending Lame 3.97. There was also a listening test where Lame 3.97 came out great, with a bigger quality difference against the contenders compared to the more or less equal scores in this test as far as average score is concerned. It was after the test that 3.97's 'sandpaper problem' became known. The question is how to weigh it, the question is: how annoying is it for the person who reads the recommendation? It may be negligible, it may be a big issue.
The problem is that we can't test encoders on the universe of music. We can get significant experience with encoders, that's why Sebastian's test is important. But we should always take the results with a grain of salt.
There's also the question what kind of a result you have in focus. Usually people concentrate on the average result of an encoder averaged over all samples. But is this really the real thing which is most important? That's a very personal question. You can look at worst case behavior which is what I do in the first place. To me it's more important that my favorite encoder has a low number of scores below 4.0, and - at best - there is no sample with a score below 3.0. But this too has to be taken with a grain of salt. A bad score on an (to me) exotic sample doesn't count much to me, but it has a very high impact if it happens with music of my favorite genre. So evaluating an encoder is more than just looking at the average scores of a listening test.
Instead of giving a rather strong recommendation as was done so far I'd prefer if we had a weaker suggestion, kind of:
When targeting at a quality which can be achieved with ~128 kbps on average the most recent mp3 listening test has shown that the current versions of Lame, Helix, Fraunhofer, iTunes all do an excellent job. Quality differences between them were negligible within this test as far as the average outcome was concerned, with XXX and YYY having the best consistency in high quality (in case it turns out that such a statement can be made).
Pio2001
Nov 25 2008, 00:48
Excuse me, but what is the correspondance between the files and the encoders ?
krabapple
Nov 25 2008, 01:02
Sorry, it's not clear to me how many subjects participated. Can you point me to that in the graph?
QUOTE (Pio2001 @ Nov 25 2008, 02:48)

Excuse me, but what is the correspondance between the files and the encoders ?
As shown on the pictures above -- samplexx_1.mp3 is encoded by iTunes, samplexx_2.mp3 - lame 3.98.2 etc. Of course, ABC/HR tool randomizes order of samples every time you load abc/hr config file (Samplexx.ecf)
QUOTE (krabapple @ Nov 25 2008, 03:02)

Sorry, it's not clear to me how many subjects participated. Can you point me to that in the graph?
Downloaded results.rar:
39 - 26 - 26 - 27 - 30 - 26 - 26 - 26 - 26 - 26 - 27 - 26 - 29 - 30.
Pio2001
Nov 25 2008, 01:37
Thanks, I analyzed my own results. That's what I feared. The ranking of the encoders is different for each sample. A Tukey HSD analysis on my ratings give them all tied too, except the low anchor.
Thanks again to Sebastian !
Raiden
Nov 25 2008, 02:12
QUOTE (Pio2001 @ Nov 25 2008, 01:34)

Oh, and Greynol is right. Helix is not winner. It is tied. The differences are within the confidence intervals, which means that they are just random. If you redid the test with the same samples and same listeners, the simple fact the ABC/HR presents them in a different order every time would probably lead Lame, or Fraunhofer, or iTunes to get a slightly, but not significantly, superior score.
We must consider this to be chance, unless we have more information to backup further claims.
agreed. Probably the zoomed plot window should be removed, since it's quite misleading and doesn't show any useful information.
QUOTE (Raiden @ Nov 24 2008, 17:12)

agreed. Probably the zoomed plot window should be removed, since it's quite misleading and doesn't show any useful information.
Not really. It's just a zoomed version of the previous data. It is not misleading, it just represents the same data as the other graph and text in a slightly different form.
Regarding statistics... the confidence intervals will decrease in size if there are more participants?
Great to have updated results for 2008; thanks Sebastian!
Squeller
Nov 25 2008, 08:46
Is this claim correct? There has been no improvement on the Helix encoder since after 2005?
Sebastian Mares
Nov 25 2008, 09:02
QUOTE (sld @ Nov 25 2008, 04:51)

Regarding statistics... the confidence intervals will decrease in size if there are more participants?
Great to have updated results for 2008; thanks Sebastian!
Yes, that is correct. The more people post, the shorter the confidence intervals.
melomaniac
Nov 25 2008, 09:19
I analyzed my results and the ranking of the encoders is different for each sample. So indeed there's no undisputed winner here.
Though I have Helix at the first place in some samples. I would have never expected that! Nice surprise

Another surprise to me is that, on some samples, I found LAME 3.97 worse than Fhg or iTunes.
And finally, I don't have any results where LAME 3.97 is better than 3.98.2.
QUOTE (DigitalDictator @ Nov 24 2008, 23:56)

This is indeed surprising. I'm sure I've seen smaller, recent, ABX-tests where Lame has outperformed Helix quite clearly. I think Guruboolez and maybe also Halb27 have done a couple, but I might be mistaken.
Last time I've seen Francis doing an MP3 listening evaluation with LAME and Helix is on
this post.
QUOTE (Squeller @ Nov 25 2008, 08:46)

Is this claim correct? There has been no improvement on the Helix encoder since after 2005?
It's correct.
Here's the latest compile (v5.1 2005.08.09) used in this test.
EDIT: LAME 3.97 comments
halb27
Nov 25 2008, 09:19
Zoomed view is formally correct, but has a tendency to have an incorrect emotional impact on the reader as it emphasizes differences. In its extreme form it can give the picture of extreme differences where in fact differences are not worth mentioning.
In case the confidence interval were not given in this test's zoomed view, only the averages, we would have this extreme form here.
Information at a glance, that's what graphs are for. They easily give a wrong impression if they're are not 'ground-based' but have a basis high in the air just a small step below the lowest results.
That's why I would prefer if there was no 'zoomed' view.
Squeller
Nov 25 2008, 09:28
QUOTE (halb27 @ Nov 25 2008, 10:19)

Zoomed view is formally correct, but has a tendency to have an incorrect emotional impact on the reader as it emphasizes differences. In its extreme form it can give the picture of extreme differences where in fact differences are not worth mentioning.
In case the confidence interval were not given in this test's zoomed view, only the averages, we would have this extreme form here.
Information at a glance, that's what graphs are for. They easily give a wrong impression if they're are not 'ground-based' but have a basis high in the air just a small step below the lowest results.
That's why I would prefer if there was no 'zoomed' view.
Basically you are right, but: People with dysfunctional brains who don't find this out themselves aren't the target audience of HA I guess

About Helix: Lets not forget Guru's
listening test from 2007 where Helix clearly failed on classical music.
halb27
Nov 25 2008, 09:44
QUOTE (Squeller @ Nov 25 2008, 10:28)

Basically you are right, but: People with dysfunctional brains who don't find this out themselves aren't the target audience of HA I guess

....
Sure, but blind people aren't the target audience either. The non-zoomed view gives all the information we need.
melomaniac
Nov 25 2008, 09:46
QUOTE (Squeller @ Nov 25 2008, 09:28)

About Helix: Lets not forget Guru's
listening test from 2007 where Helix clearly failed on classical music.
I've already posted the link Squeller.
memomai
Nov 25 2008, 10:08
Just confused. Helix worse than lame, Helix better than lame, Fraunhofer better than 3.97??
I'm only waiting that someone says "lossless is lossy", then my confusion is completed.
halb27
Nov 25 2008, 10:30
QUOTE (memomai @ Nov 25 2008, 11:08)

Just confused. Helix worse than lame, Helix better than lame, Fraunhofer better than 3.97??
I'm only waiting that someone says "lossless is lossy", then my confusion is completed.
There has always been a tendency at HA that Lame is expected to be seriously superior as compared with other encoders. And listening tests have always been taken too much of a 'proof' for this whereas they contribute experience with encoders in a pretty objective way but only within the restrictions of the samples tested and the listening abilities of the participants. It's the best we can do, but has its restrictions.
Why worry? Isn't it a good thing that all the encoders perform very well on the samples?
As for Lame 3.98.2: isn't it a good thing that it scores so well? All we have known so far is that that it brings improvement over 3.97 for certain classes of problems where 3.97 had a rather weak quality. We did not have a lot of experience that there is no serious regression with 3.98 which is possible. Now we have reason to beleive that this is not the case, we can expect from 3.98 with good reason that 3.98 is a real progress.
Alexxander
Nov 25 2008, 10:44
Before anything I have to thank Sebastian again for having conducted this nice MP3 Listening Test @ 128 kbps!
I think I was too hard rating the samples but my results are very similar to the overall results:
CODE
My Average Test Results
iTunes 8.0.1 2,45 4,26
Lame 3.98.2 2,94 4,51
l3enc 0.99a 1,00 1,56
Fraunhofer 2,84 4,44
Lame 3.97 2,77 4,28
Helix v5.1 3,20 4,59
"Test Results" are the results of all participants. "My Average" is a simple linear average of the results as I don't remember how to do other type of analysis (too long ago

). Taking out the highest and lowest result of all encoders produces a similar result as presented above. If anyone can tell me which formula to use in MS Excel to get error margin please do.
I'm really surprised an encoder that hasn't been tuned since 2005 gets these good results. I have more samples Helix doing better than Lame 3.98.2 than the other way around allthough differences are small. When doing the Test I noticed clearly 2 encoders were better than the rest and I thought they were the Lame ones
Alexxander
Nov 25 2008, 10:55
QUOTE (halb27 @ Nov 25 2008, 10:30)

...
Why worry? Isn't it a good thing that all the encoders perform very well on the samples?
...
I'm worried now not because of Helix being very competitive with Lame 3.98.2 with respect to quality but because Helix encodes so much faster and that's very usefull when I encode albums from my lossless archive to take them on the road.
I wonder why Lame doesn't do better compared to Helix having 3 years more of development on its back. I just have included Helix in my foobar2000 Converters list and will play with this one in my preferred bitrange (160-220kbps).
muaddib
Nov 25 2008, 11:15
It is not good to conclude, from the results of this test, that Helix will be the best option in 160-220 kbps range. You should check quality after encoding to this bitrate.
Jan S.
Nov 25 2008, 12:33
Wouldn't it be possible to compare the variance within each encoder to get an idea of the robustness of each encoder?
Alexxander
Nov 25 2008, 12:37
QUOTE (muaddib @ Nov 25 2008, 11:15)

It is not good to conclude, from the results of this test, that Helix will be the best option in 160-220 kbps range. You should check quality after encoding to this bitrate.
This is very obvious. I added Helix to foobar2000 to do just this: compare quality with some songs and samples
Sebastian Mares
Nov 25 2008, 12:44
QUOTE (Jan S. @ Nov 25 2008, 12:33)

Wouldn't it be possible to compare the variance within each encoder to get an idea of the robustness of each encoder?
I am not quite sure I understand what you mean.
robert
Nov 25 2008, 13:01
I would be more interested in Quartile, instead of Varianz.
http://de.wikipedia.org/wiki/Quantil
halb27
Nov 25 2008, 13:04
QUOTE (Alexxander @ Nov 25 2008, 11:55)

...I just have included Helix in my foobar2000 Converters list and will play with this one in my preferred bitrange (160-220kbps).
You may want to try level's finding about his kind of quality improvement in this bitrate range which you can find in the Helix thread. Quality improvement was not confirmed though by other people.
kwanbis
Nov 25 2008, 13:06
QUOTE (robert @ Nov 25 2008, 12:01)

I would be more interested in Quartile, instead of Varianz.
http://de.wikipedia.org/wiki/Quantil http://en.wikipedia.org/wiki/Quantile(i think more people knows english

)
westgroveg
Nov 25 2008, 13:55
If anything the test shows samples where LAME needs improvement at 128 kbps.
Pio2001
Nov 25 2008, 14:04
QUOTE (melomaniac @ Nov 25 2008, 09:19)

And finally, I don't have any results where LAME 3.97 is better than 3.98.2.
I do. If Lame 3.98.2 is file 2 and 3.97 is file 5, then sample 8 sounds near-transparent to me with Lame 3.97, not with 3.98.2. I also find sample 11 better with Lame 3.97.
Sebastian Mares
Nov 25 2008, 14:47
QUOTE (robert @ Nov 25 2008, 13:01)

I would be more interested in Quartile, instead of Varianz.
http://de.wikipedia.org/wiki/QuantilAll results are available for download already so you can calculate whatever you wish. Tukey HSD is something around 0.5 IIRC (I'm at work right now and don't have access to the exact value) so the tolerance bars are around 0.25 in each direction.
Alex B
Nov 25 2008, 14:51
QUOTE (westgroveg @ Nov 25 2008, 14:55)

If anything the test shows samples where LAME needs improvement at 128 kbps.
I think we should analyze the results sample by sample and discuss about the severity of the found problems. It would be useful to find out if certain obvious problems with certain encoders were apparently confirmed by the majority of the testers.
In general, I found the choice of the low anchor a bit problematic. The encoder is clearly badly broken. Obviously the 0.99 alpha version is not the version that was involved when the 128 kbps MP3 = CD quality myth was created. In my experience the release version was already a lot better.
A too bad low anchor can have an adverse effect to the rating scale the testers choose to use. It can make the differences between the contenders appear to be less significant.
For comparison, here are my results:
CODE
% Result file produced by chunky-0.8.4-beta
% ..\chunky.exe --codec-file=..\codecs.txt -n --ratings=results --warn -p 0.05
% Sample Averages:
% iTunes L398 Anchor Fhg L397 Helix
01. 3.80 2.60 1.00 3.40 1.80 4.30
02. 3.80 2.60 1.40 2.80 2.20 3.10
03. 3.90 3.10 1.00 4.30 3.70 2.90
04. 4.20 4.40 1.00 4.40 3.30 4.40
05. 2.70 3.50 1.00 3.00 3.00 3.80
06. 2.20 3.50 1.00 3.00 3.90 4.00
07. 3.70 4.00 1.00 3.80 2.00 4.00
08. 2.40 4.00 1.00 3.00 4.30 3.00
09. 3.00 3.40 1.00 3.60 3.40 2.50
10. 4.50 4.50 1.00 4.20 4.00 4.50
11. 3.90 2.70 1.00 3.50 3.90 2.40
12. 2.00 3.70 1.00 2.60 3.30 3.60
13. 3.70 3.20 1.00 3.80 2.50 4.00
14. 2.00 3.60 1.00 3.10 3.30 4.00
% Codec averages:
%%% 3.27 3.49 1.03 3.46 3.19 3.61
Just try some Metal tracks on Helix at V60, I guarantee it will struggle.
Neasden
Nov 25 2008, 16:07
/mnt told me that Helix is not gapless, which is to me a serious shortcomming. Another thing is that Helix is not that robust as LAME is. But what is stunning people here is the encoding speed of a encoder it hasn't been worked on for 3 years, while latest fresh LAME is so so much slower to encode!
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please
click here.