IPB

Welcome Guest ( Log In | Register )

3 Pages V   1 2 3 >  
Reply to this topicStart new topic
Public AAC Listening Test @ ~96 kbps [July 2011]: Results, Results and post-test discussion
IgorC
post Aug 23 2011, 19:56
Post #1





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



After the long time of preparations, discussions and realization of the test the results are finally here.

http://listening-tests.hydrogenaudio.org/i...-a/results.html

Summary: Apple won, FhG is the second, Coding Technologies is the third and Nero is the last

I appreciate all people who has supported the test and participated in it.

This post has been edited by IgorC: Aug 23 2011, 20:12
Go to the top of the page
+Quote Post
benski
post Aug 23 2011, 20:18
Post #2


Winamp Developer


Group: Developer
Posts: 669
Joined: 17-July 05
From: Ashburn, VA
Member No.: 23375



It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.
Go to the top of the page
+Quote Post
Garf
post Aug 23 2011, 20:27
Post #3


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (benski @ Aug 23 2011, 21:18) *
It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.


Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?
Go to the top of the page
+Quote Post
C.R.Helmrich
post Aug 23 2011, 20:42
Post #4





Group: Developer
Posts: 681
Joined: 6-December 08
From: Erlangen Germany
Member No.: 64012



QUOTE (benski @ Aug 23 2011, 21:18) *
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Interesting results. I guess I have to add Sample20 to my standard test set at work...

Chris


--------------------
If I don't reply to your reply, it means I agree with you.
Go to the top of the page
+Quote Post
Garf
post Aug 23 2011, 20:47
Post #5


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (C.R.Helmrich @ Aug 23 2011, 21:42) *
QUOTE (benski @ Aug 23 2011, 21:18) *
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.


Yes (aggregate over all listeners). Note that the graphics are simplified plots, and don't have the correct confidence intervals for the bootstrap (because the tool doesn't support generating them) nor for ANOVA (IIRC, the plots don't consider the blocking).

This is why you'll see overlap in the graphics but not in the bootstrap nor blocked ANOVA results.

Basically, the graphics suck, but they look cute wink.gif

This post has been edited by Garf: Aug 23 2011, 20:49
Go to the top of the page
+Quote Post
benski
post Aug 23 2011, 20:52
Post #6


Winamp Developer


Group: Developer
Posts: 669
Joined: 17-July 05
From: Ashburn, VA
Member No.: 23375



QUOTE (C.R.Helmrich @ Aug 23 2011, 15:42) *
QUOTE (benski @ Aug 23 2011, 21:18) *
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Interesting results. I guess I have to add Sample20 to my standard test set at work...

Chris


Yes, the ANOVA test is using Friedman which is ranking the codecs. The graphs seem to be built based on parametric analysis of the test results as if they were normal data.
Go to the top of the page
+Quote Post
benski
post Aug 23 2011, 20:54
Post #7


Winamp Developer


Group: Developer
Posts: 669
Joined: 17-July 05
From: Ashburn, VA
Member No.: 23375



QUOTE (Garf @ Aug 23 2011, 15:27) *
QUOTE (benski @ Aug 23 2011, 21:18) *
It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.


Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?


Sorry, I only now read the caveat in the results page - "The graphs are a simple ANOVA analysis over all submitted and valid results. This is compatible with the graphs of previous listening tests, but should only be considered as a visual support for the real analysis.". My initial reaction was to the box-plot graphs, not to the analysis at the bottom of the page.

The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.

This post has been edited by benski: Aug 23 2011, 21:00
Go to the top of the page
+Quote Post
IgorC
post Aug 23 2011, 21:01
Post #8





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (benski @ Aug 23 2011, 16:18) *
It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

ranksum analysis would be even more unfavorable for FhG encoder.
Go to the top of the page
+Quote Post
benski
post Aug 23 2011, 21:07
Post #9


Winamp Developer


Group: Developer
Posts: 669
Joined: 17-July 05
From: Ashburn, VA
Member No.: 23375



QUOTE (IgorC @ Aug 23 2011, 16:01) *
QUOTE (benski @ Aug 23 2011, 16:18) *
It would be interesting to do a rank-sum analysis comparing each pair of encoders. Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

ranksum analysis would be even more unfavorable for FhG encoder.


Actually, it gives the same order. CVBR > TVBR > FhG > CT > Nero
Go to the top of the page
+Quote Post
IgorC
post Aug 23 2011, 21:11
Post #10





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (benski @ Aug 23 2011, 17:07) *
Actually, it gives the same order. CVBR > TVBR > FhG > CT > Nero


Yes but TVBR> FhG with p = 0.00 for rank-sum. But not for bootstrap.
Go to the top of the page
+Quote Post
Garf
post Aug 23 2011, 21:11
Post #11


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (benski @ Aug 23 2011, 21:54) *
The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.


(Blocked) ANOVA is a parametric, means-based test. FRIEDMAN is the name of the utility (which unsurprisingly, also supports Friedman analysis). The result posted is means-based, not rank-based. It's there mostly to allow referencing with older tests and with other statistical packages, which are more likely to support normal blocked ANOVA than the nonparametric variants. Friedman wasn't developed further because it doesn't allow p-value step-down without losing a significant amount of power for many comparisons, and because for high-bitrate tests it is no longer clear the results are normally distributed. That's exactly what lead to bootstrap.

This post has been edited by Garf: Aug 23 2011, 21:48
Go to the top of the page
+Quote Post
IgorC
post Aug 23 2011, 21:30
Post #12





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



I should also mention that I've participated in this test too. Steve Forte Rio has made the ABC/HR sessions and new key for me. He has checked my results. (You can find this key in results.zip).
A Big thank You to him for that.

If somebody is interested to analyse the results:

SampleXX - original
SampleXX_1 - Nero
SampleXX_2 - Apple CVBR
SAmpleXX_3 - Apple TVBR
SampleXX_4 - FhG (Winamp 5.62)
SampleXX_5 - Coding Technologies (Winamp 5.61)
SampleXX_6 - ffmpeg AAC (low anchor)


This post has been edited by IgorC: Aug 23 2011, 21:37
Go to the top of the page
+Quote Post
zima
post Aug 23 2011, 21:37
Post #13





Group: Members
Posts: 128
Joined: 3-July 03
From: Pomerania
Member No.: 7541



Maybe there could be a legend for X-axis, the abbreviations used, at least under the first graph?

FhG, low_anchor* and Nero are almost fine enough (*though "wait, what was it again"? ;p ), but making CT, CVBR, and TVBR clear might require going back to the test page; which I think should be superfluous.


--------------------
http://last.fm/user/zima
Go to the top of the page
+Quote Post
lvqcl
post Aug 23 2011, 22:05
Post #14





Group: Developer
Posts: 3212
Joined: 2-December 07
Member No.: 49183



It is interesting that QT tvbr and cvbr encoded files are identical for samples # 7, 10, 13, 14. (foobar2000 comparator: "No differences in decoded data found")
Go to the top of the page
+Quote Post
IgorC
post Aug 23 2011, 22:11
Post #15





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



zima,

will fix it later.

QUOTE (lvqcl @ Aug 23 2011, 18:05) *
It is interesting that QT tvbr and cvbr encoded files are identical for samples # 7, 10, 13, 14. (foobar2000 comparator: "No differences in decoded data found")

Yes, I've noticed that too. It's interesting that listeners still rate them differently (despite they are bit-exact). Though it's normal.

This post has been edited by IgorC: Aug 23 2011, 22:15
Go to the top of the page
+Quote Post
Alexxander
post Aug 23 2011, 22:31
Post #16





Group: Members
Posts: 456
Joined: 15-November 04
Member No.: 18143



Thanks to all who participated in this test and to those who made this test possible, especially to IgorC.

My findings are in line with the general results and I am actually surprised by Nero ending up rather low. Curious to see CVBR mean is a bit higher than TVBR but I suppose this fact really has not much meaning as both fall into each other's confidence interval.

Some personal testing about a year ago with Apple CVBR at around 128kbps, I found it stunning good but never really compared it to Nero (I use Nero for 2 years now). Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?

Many thanks again!

Go to the top of the page
+Quote Post
Garf
post Aug 23 2011, 23:02
Post #17


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (Alexxander @ Aug 23 2011, 23:31) *
Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?


This is a tough question. The quality of the tuning can make a difference. But barring any more information, I'd bet the codec that is better tuned/performing at 100kbps will perform better at 128kbps, too.

You could say that the codecs performance at 100kbps is a hint, but not proof, of how it will do at 128kbps.
Go to the top of the page
+Quote Post
Dakeryas
post Aug 23 2011, 23:11
Post #18





Group: Members
Posts: 58
Joined: 11-July 09
From: Lorraine, France
Member No.: 71375



Many thanks for the test !

Interesting to notice Nero's lesser performance, even though I'm encoding at much higher bitrates, I should definitely have a look at that qtaacenc thing (huh, Apple stuff).
Go to the top of the page
+Quote Post
IgorC
post Aug 23 2011, 23:51
Post #19





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



I've noticed that previous version of Nero 1.0.7.0 produces much better quality than last 1.5.4.0 at 96-100 kbps (I've made blind tests though. Do not blame me with TOS8)

The only explanation that comes in my mind that is probably tuning for some bitrates could produce regression for another ones.

This post has been edited by IgorC: Aug 23 2011, 23:52
Go to the top of the page
+Quote Post
Gornot
post Aug 24 2011, 00:29
Post #20





Group: Members
Posts: 6
Joined: 25-December 09
Member No.: 76334



To be perfectly honest, I am surprised that FhG did so well against Coding Technologies. Since Winamp introducted it, some of my songs seemed to retain more quality when encoded with the Coding Technologies' encoder rather than FhG.
Too bad I found out about the test two days after it was already closed. I've been anxious to see the results; interesting how Nero did the worst. Great information for future references biggrin.gif
Go to the top of the page
+Quote Post
/mnt
post Aug 24 2011, 01:23
Post #21





Group: Members
Posts: 696
Joined: 22-April 06
Member No.: 29877



Interesting results, I gotta see if the pre-echo handling with sharp attacks is improved on QuickTime though.

Sadly am not suprised that Nero lost. It still has trouble with hi-hats and with certain regression issues, that have been introduced after 1.0.0.7.

This post has been edited by /mnt: Aug 24 2011, 01:25


--------------------
"I never thought I'd see this much candy in one mission!"
Go to the top of the page
+Quote Post
kennedyb4
post Aug 24 2011, 01:42
Post #22





Group: Members
Posts: 772
Joined: 3-October 01
Member No.: 180



If it is fair to say that many of the samples were "killer" samples, the performance of CVBR is quite good. I will still continue with TVBR as there are substantial bits saved on "easier" samples.

Thanks again for this learning experience, and for the hard work of all concerned.
Go to the top of the page
+Quote Post
Sebastian Mares
post Aug 24 2011, 07:29
Post #23





Group: Members
Posts: 3629
Joined: 14-May 03
From: Bad Herrenalb
Member No.: 6613



It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?


--------------------
http://listening-tests.hydrogenaudio.org/sebastian/
Go to the top of the page
+Quote Post
greynol
post Aug 24 2011, 07:49
Post #24





Group: Super Moderator
Posts: 10000
Joined: 1-April 04
From: San Francisco
Member No.: 13167



I was wondering the same thing.


--------------------
Your eyes cannot hear.
Go to the top of the page
+Quote Post
Garf
post Aug 24 2011, 09:35
Post #25


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (Sebastian Mares @ Aug 24 2011, 08:29) *
It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?


Not sure about this one, I thought it should "calibrate the scale". (Because the overall quality is so high, it's less needed at the upper end)

If you don't use an anchor, what happens is that for a minor distortion users will tend to slam down the slider. The anchor serves as a reminder "what really bad really is".

It would be more useful if the anchor stayed the same throughout the tests, I guess. Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.

FWIW, this is a somewhat relevant and interesting paper I hadn't seen before:
http://www.acourate.com/Download/BiasesInM...teningTests.pdf
Go to the top of the page
+Quote Post

3 Pages V   1 2 3 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 20th April 2014 - 09:39