IPB

Welcome Guest ( Log In | Register )

3 Pages V  < 1 2 3 >  
Reply to this topicStart new topic
Public AAC Listening Test @ ~96 kbps [July 2011]: Results, Results and post-test discussion
Nezmer
post Aug 24 2011, 11:35
Post #26





Group: Members
Posts: 8
Joined: 6-May 11
Member No.: 90410



QUOTE (Garf @ Aug 24 2011, 10:35) *
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.
Go to the top of the page
+Quote Post
Garf
post Aug 24 2011, 12:46
Post #27


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (Nezmer @ Aug 24 2011, 12:35) *
QUOTE (Garf @ Aug 24 2011, 10:35) *
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.


Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc.

Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference.
http://listeningtests.t35.me/html/AAC_at_1...est_results.htm

As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs.

The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms.
Go to the top of the page
+Quote Post
Nezmer
post Aug 24 2011, 18:26
Post #28





Group: Members
Posts: 8
Joined: 6-May 11
Member No.: 90410



QUOTE (Garf @ Aug 24 2011, 13:46) *
QUOTE (Nezmer @ Aug 24 2011, 12:35) *
QUOTE (Garf @ Aug 24 2011, 10:35) *
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.


Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc.

Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference.
http://listeningtests.t35.me/html/AAC_at_1...est_results.htm

As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs.

The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms.


I stand corrected.

The AAC encoder still needs `-strict experimental` to be enabled and I assumed they would distribue a basic encoder first then gradually implement optimisations.

Looking at the git log of 'aacenc.c', The last four commits contain three fixes and one library change. But before that, the psymodel seems to have been the focus of the work accomplished earlier this year.

How does all this affect the quality of the encoder? I don't know.
Go to the top of the page
+Quote Post
C.R.Helmrich
post Aug 24 2011, 18:27
Post #29





Group: Developer
Posts: 681
Joined: 6-December 08
From: Erlangen Germany
Member No.: 64012



Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.

CODE
Sample   Length[s]  nero     QT CVBR  QT TVBR  FhG      CT CBR   Anchor
--------------------------------------------------------------------------
01       30         109      108      119      120      100      102
02        9          75       94       67       77      100       76
03       13          93      112      102       97      100      101
04       28         102       99       98      113      100      103
05       30          95       97       95       99      100       98
06       20          81       98       84       90      100      105
07       22         109      107      107      125      100      103
08       28          94      105       82       95      100       97
09        9          96       98       95      106      100      104
10       30          98      106      106      101      100       99
11       20          96       97       87      104      100      100
12       15         100      110      101      100      100      100
13       10         101      101      101       95      100       99
14       10          89       97       97      105      100      104
15       19         105      109      113      117      100      101
16       28          90       96       84       91      100      101
17       20         104       97       90      105      100      104
18       18          65       93       67       84      100      102
19       16         106       98       91      101      100       96
20       30          90       96       83       83      100       97
--------------------------------------------------------------------------
Mean     20.3        95-96   101       93-94   100-101  100      100

Chris


--------------------
If I don't reply to your reply, it means I agree with you.
Go to the top of the page
+Quote Post
Garf
post Aug 25 2011, 07:29
Post #30


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (C.R.Helmrich @ Aug 24 2011, 19:27) *
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.


Thanks, added to the results page.
Go to the top of the page
+Quote Post
Zarggg
post Aug 25 2011, 18:06
Post #31





Group: Members
Posts: 533
Joined: 18-January 04
From: bethlehem.pa.us
Member No.: 11318



Just looking for a quick verification on whether I'm interpreting the results properly:

Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference?
Go to the top of the page
+Quote Post
greynol
post Aug 25 2011, 18:12
Post #32





Group: Super Moderator
Posts: 10000
Joined: 1-April 04
From: San Francisco
Member No.: 13167



CVBR and TVBR are statistically tied. One did not do better than the other, not even slightly.


--------------------
Your eyes cannot hear.
Go to the top of the page
+Quote Post
IgorC
post Aug 25 2011, 18:28
Post #33





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (Zarggg @ Aug 25 2011, 14:06) *
Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference?


The reason is that CVBR was at 100 kbps and TVBR at 95 kbps. That was limitation of bitrate scale.


This post has been edited by IgorC: Aug 25 2011, 18:30
Go to the top of the page
+Quote Post
greynol
post Aug 25 2011, 19:04
Post #34





Group: Super Moderator
Posts: 10000
Joined: 1-April 04
From: San Francisco
Member No.: 13167



That assumes facts not in evidence.


--------------------
Your eyes cannot hear.
Go to the top of the page
+Quote Post
Gecko
post Aug 25 2011, 20:24
Post #35





Group: Members
Posts: 934
Joined: 15-December 01
From: Germany
Member No.: 662



First, thank you IgorC and everyone involved!

How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far:
1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only 34_GECKO_test??.txt).
2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo"
3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt

a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?
b) Can step 1. be done more efficiently?
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)
Go to the top of the page
+Quote Post
Zarggg
post Aug 25 2011, 22:47
Post #36





Group: Members
Posts: 533
Joined: 18-January 04
From: bethlehem.pa.us
Member No.: 11318



QUOTE (greynol @ Aug 25 2011, 13:12) *
CVBR and TVBR are statistically tied. One did not do better than the other, not even slightly.

That answer is just as good for my own edification. Thanks.
Go to the top of the page
+Quote Post
mjb2006
post Aug 25 2011, 22:50
Post #37





Group: Members
Posts: 706
Joined: 12-May 06
From: Colorado, USA
Member No.: 30694



Even though I sent in results, they didn't get included at all, neither accepted nor rejected. I sent them to the correct address on the last day (Aug. 20). Anyway, going through them now, I found that I kind of botched #04, which was the clip from the intro to OMD's "Enola Gay".

The second half of the clip has the hi-hats panned slightly to the right of center. All the encoders except ffmpeg(!) seem to put the hi-hats pretty much dead-center. The panning is so slight that I didn't notice the issue at all until the 5th comparison, which for me was Apple CVBR (Sample04_2). In that comparison, the panning was the only difference I noticed.

At that point, I should've checked the reference clip, but instead I checked my previous answers, and found that on one comparison (which turns out to be ffmpeg), the encoded version sounded really bad, but the hi-hats were panned slightly to the right, so I incorrectly guessed that the panning was an artifact. Oops. So for comparison #5, I guessed that CVBR was the original and I rated the actual original as inferior.

Now I feel like I should've listened to the reference clip at that point and spotted it there, then gone back and changed my answers on the previous comparisons. But instead I just left them as they were:
  • comparison #1: no differences noticed between original & Sample04_4 (FhG)
  • comparison #2: Sample04_6 (ffmpeg) rated very inferior due to ringing synths, syrupy hi-hats
  • comparison #3: Sample04_1 (Nero) rated somewhat inferior due to ringing synths
  • comparison #4: no differences noticed between original & Sample04_3 (TVBR)

If I were to go back after noticing the panning issue and listen for it in the comparisons I had already completed, I would've noticed it in #1 and #4, and would've correctly spotted and downgraded my ratings for the encoded clips. But doesn't going back like that, once I've told myself what to listen for, invalidate those results? If I only "naturally" notice the panning some of the time, shouldn't that just be accepted?

This post has been edited by mjb2006: Aug 25 2011, 22:54
Go to the top of the page
+Quote Post
IgorC
post Aug 26 2011, 03:30
Post #38





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (Gecko @ Aug 25 2011, 16:24) *
First, thank you IgorC and everyone involved!

biggrin.gif
Thank You too for your complete 20 results.

QUOTE (Gecko @ Aug 25 2011, 16:24) *
How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far:
1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only 34_GECKO_test??.txt).
2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo"
3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt

a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?
b) Can step 1. be done more efficiently?
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)

a) Both are fine. Though I'm also interested to hear Garf on this subject.
b) Yes, there is easier way. There is "Sorted by listener" folder. Find folder with your results ("34_GECKO"), rename it to "Sample01" and run chunky on it.
c) You should copy-paste all results (results01, results02, ... , results20) to results_AAC_2011.txt. Without spaces or comments. You will have 280 results totally: sample01 - 21 results, sample02 - 20 results, ... etc. -> summary: 280 results. If you have any issues then see "results_AAC_2011.txt".



mjb2006
I do not accept the results after the closure of the test (evening 20 Aug).
Your results would be discarded anyway.
Your results for samples 03 and 04 are invalid. Two invalid results from your total 5 results (01,02,03,04,05)-> means complete discard. Read the rules.txt
I've repeated many times to send single results as soon as possible to re-do them in case of errors.
And your results are dated by 26 July. There is nobody to blame.

This post has been edited by IgorC: Aug 26 2011, 03:47
Go to the top of the page
+Quote Post
mjb2006
post Aug 26 2011, 05:18
Post #39





Group: Members
Posts: 706
Joined: 12-May 06
From: Colorado, USA
Member No.: 30694



QUOTE (IgorC @ Aug 25 2011, 20:30) *
I do not accept the results after the closure of the test (evening 20 Aug). Your results would be discarded anyway.
I'm not upset, and I did not wish to imply that I was arguing about whether my results should have been considered valid. Clearly they are not.

Besides, I see now what happened. On 20 Aug I realized I would not have time to do more tests, so I checked the thread, and you had not yet made your post saying the test was closed, so I RARed my old results (file modification time 16:04:09-0600) and sent them (email time 16:05:34). I see now that you posted in that very short interval (post time 16:04:xx).

And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests. This meaning is not at all obvious when you said that sending results early "helps to prevent some simple errors related to ABC-HR application or any other at early stage," which sounds like you're referring to logistical issues and also seems to be the only time you mentioned it in the test thread, not something you "repeated many times."

Anyway, is it normal for ~27% of listeners to have their results discarded?

Go to the top of the page
+Quote Post
IgorC
post Aug 26 2011, 05:24
Post #40





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (mjb2006 @ Aug 26 2011, 01:18) *
Anyway, is it normal for ~27% of listeners to have their results discarded?

Yes, it is normal. The quality is pretty high.


QUOTE (mjb2006 @ Aug 26 2011, 01:18) *
And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests

http://www.hydrogenaudio.org/forums/index....st&p=764051
http://www.hydrogenaudio.org/forums/index....st&p=763480



This post has been edited by IgorC: Aug 26 2011, 05:43
Go to the top of the page
+Quote Post
Garf
post Aug 26 2011, 07:16
Post #41


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (Gecko @ Aug 25 2011, 21:24) *
a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?


Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this.

This is a cartoon explaining what happens if we WOULDN'T do that:
http://www.xkcd.com/882/

QUOTE
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)


This must be done by hand. Chunky has a bug where by default it slams all listeners together per sample in its final result (so you end up with a result as if only a single person had taken the test).
Go to the top of the page
+Quote Post
Garf
post Aug 26 2011, 07:20
Post #42


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (greynol @ Aug 25 2011, 19:12) *
CVBR and TVBR are statistically tied. One did not do better than the other, not even slightly.


More correctly:

One did do better than the other (the means are not equal). But there's still a high enough probability that that result was due to random chance instead of one codec being better than another, so we don't want to make any conclusions based on it.
Go to the top of the page
+Quote Post
IgorC
post Aug 26 2011, 08:51
Post #43





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (Garf @ Aug 26 2011, 03:16) *
Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this.

This is a cartoon explaining what happens if we WOULDN'T do that:
http://www.xkcd.com/882/

Then I completely misunderstood it and I apologize for my wrong answer to Gecko.
Go to the top of the page
+Quote Post
Gecko
post Aug 26 2011, 10:28
Post #44





Group: Members
Posts: 934
Joined: 15-December 01
From: Germany
Member No.: 662



Thank you IgorC and Garf for answering my questions!

In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"?

During the test I had the feeling that one (regular) sample was often worse and one often a tad better. Only the former assumption seems to be backed by my data (using the adjusted p-values): ffmpeg < Nero < All other

Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.
Go to the top of the page
+Quote Post
IgorC
post Aug 26 2011, 11:20
Post #45





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



QUOTE (Gecko @ Aug 26 2011, 06:28) *
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

That's actually how bootstraps works.
While it's the same for chunky if one particular sample has 1 or 100 results. It will throw away all particular data and will work only with average score per samples (only 20 average scores. Because there were 20 samples)

Bootstrap in change performs analysis on whole set of data (280 results for this test.) That permits to get more important statistical differences.
Literally every single result is useful.

This post has been edited by IgorC: Aug 26 2011, 11:35
Go to the top of the page
+Quote Post
Garf
post Aug 27 2011, 17:39
Post #46


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (Gecko @ Aug 26 2011, 11:28) *
In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"?


No.

Go to the top of the page
+Quote Post
Garf
post Aug 27 2011, 17:48
Post #47


Server Admin


Group: Admin
Posts: 4853
Joined: 24-September 01
Member No.: 13



QUOTE (IgorC @ Aug 26 2011, 12:20) *
QUOTE (Gecko @ Aug 26 2011, 06:28) *
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

That's actually how bootstraps works.


The result has nothing in particular to do with bootstrap. The Friedman's tool "classic" ANOVA shows the same results. The problem is that Chunky has a bug that was throwing away most listeners and this wasn't realized for a while.

With 280 submissions, quite some conclusions can be made.

What is bothersome is that the first samples were (once again) tested and hence weighted more. Maybe for the next time we should add a small DB and offer sample downloads one-by-one, after calculating from the DB what the least tested samples are.
Go to the top of the page
+Quote Post
lvqcl
post Aug 27 2011, 20:22
Post #48





Group: Developer
Posts: 3208
Joined: 2-December 07
Member No.: 49183



QUOTE (Garf @ Aug 23 2011, 23:47) *
Basically, the graphics suck, but they look cute wink.gif

Exactly. So, the graphs: tongue.gif

Ratings for all 20 samples: (without low anchor)




Sorted by Nero rating:




Without Nero, sorted by CT rating:




CVBR, TVBR and FhG, sorted by FhG rating:




All encoders, sorted by their ratings independently(!):

Go to the top of the page
+Quote Post
IgorC
post Aug 27 2011, 21:46
Post #49





Group: Members
Posts: 1506
Joined: 3-January 05
From: Argentina, Bs As
Member No.: 18803



I found the first and the last graphs to be particularly informative.

From first graph it's easy to see that half of AAC encoders did excellent on male English speech (Sample 18). The same is valid for female English speech (Sample 06).
Then it's correct to say that modern high quality AAC encoders perform very well on speech at 96-100 kbps.

I think we missed a category of sample as "speech with some background sounds". It's different from song with voice (Sample 17).
Though I haven't seen such samples among repositories with testing samples.





Go to the top of the page
+Quote Post
lvqcl
post Aug 27 2011, 22:20
Post #50





Group: Developer
Posts: 3208
Joined: 2-December 07
Member No.: 49183



QUOTE (IgorC @ Aug 28 2011, 00:46) *
I think we missed a category of sample as "speech with some background sounds"

Such as French_ad? Or maybe something like rawhide is enough?
Go to the top of the page
+Quote Post

3 Pages V  < 1 2 3 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 18th April 2014 - 17:50