Public AAC Listening Test @ ~96 kbps [July 2011]: Results, Results and post-test discussion |
![]() ![]() |
Public AAC Listening Test @ ~96 kbps [July 2011]: Results, Results and post-test discussion |
Aug 24 2011, 11:35
Post
#26
|
|
|
Group: Members Posts: 8 Joined: 6-May 11 Member No.: 90410 |
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad. The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise. |
|
|
|
Aug 24 2011, 12:46
Post
#27
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad. The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise. Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc. Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference. http://listeningtests.t35.me/html/AAC_at_1...est_results.htm As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs. The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms. |
|
|
|
Aug 24 2011, 18:26
Post
#28
|
|
|
Group: Members Posts: 8 Joined: 6-May 11 Member No.: 90410 |
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad. The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise. Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc. Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference. http://listeningtests.t35.me/html/AAC_at_1...est_results.htm As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs. The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms. I stand corrected. The AAC encoder still needs `-strict experimental` to be enabled and I assumed they would distribue a basic encoder first then gradually implement optimisations. Looking at the git log of 'aacenc.c', The last four commits contain three fixes and one library change. But before that, the psymodel seems to have been the focus of the work accomplished earlier this year. How does all this affect the quality of the encoder? I don't know. |
|
|
|
Aug 24 2011, 18:27
Post
#29
|
|
|
Group: Developer Posts: 618 Joined: 6-December 08 From: Erlangen Germany Member No.: 64012 |
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.
CODE Sample Length[s] nero QT CVBR QT TVBR FhG CT CBR Anchor -------------------------------------------------------------------------- 01 30 109 108 119 120 100 102 02 9 75 94 67 77 100 76 03 13 93 112 102 97 100 101 04 28 102 99 98 113 100 103 05 30 95 97 95 99 100 98 06 20 81 98 84 90 100 105 07 22 109 107 107 125 100 103 08 28 94 105 82 95 100 97 09 9 96 98 95 106 100 104 10 30 98 106 106 101 100 99 11 20 96 97 87 104 100 100 12 15 100 110 101 100 100 100 13 10 101 101 101 95 100 99 14 10 89 97 97 105 100 104 15 19 105 109 113 117 100 101 16 28 90 96 84 91 100 101 17 20 104 97 90 105 100 104 18 18 65 93 67 84 100 102 19 16 106 98 91 101 100 96 20 30 90 96 83 83 100 97 -------------------------------------------------------------------------- Mean 20.3 95-96 101 93-94 100-101 100 100 Chris -------------------- If I don't reply to your reply, it means I agree with you.
|
|
|
|
Aug 25 2011, 07:29
Post
#30
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar. Thanks, added to the results page. |
|
|
|
Aug 25 2011, 18:06
Post
#31
|
|
![]() Group: Members Posts: 512 Joined: 18-January 04 From: bethlehem.pa.us Member No.: 11318 |
Just looking for a quick verification on whether I'm interpreting the results properly:
Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference? |
|
|
|
Aug 25 2011, 18:12
Post
#32
|
|
![]() Group: Super Moderator Posts: 9265 Joined: 1-April 04 Member No.: 13167 |
CVBR and TVBR are statistically tied. One did not do better than the other, not even slightly.
-------------------- Everything sounds the same until it is proven otherwise.
|
|
|
|
Aug 25 2011, 18:28
Post
#33
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference? The reason is that CVBR was at 100 kbps and TVBR at 95 kbps. That was limitation of bitrate scale.
This post has been edited by IgorC: Aug 25 2011, 18:30 |
|
|
|
Aug 25 2011, 19:04
Post
#34
|
|
![]() Group: Super Moderator Posts: 9265 Joined: 1-April 04 Member No.: 13167 |
That assumes facts not in evidence.
-------------------- Everything sounds the same until it is proven otherwise.
|
|
|
|
Aug 25 2011, 20:24
Post
#35
|
|
![]() Group: Members Posts: 913 Joined: 15-December 01 From: Germany Member No.: 662 |
First, thank you IgorC and everyone involved!
How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far: 1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only 34_GECKO_test??.txt). 2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo" 3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)? b) Can step 1. be done more efficiently? c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand) |
|
|
|
Aug 25 2011, 22:47
Post
#36
|
|
![]() Group: Members Posts: 512 Joined: 18-January 04 From: bethlehem.pa.us Member No.: 11318 |
|
|
|
|
Aug 25 2011, 22:50
Post
#37
|
|
|
Group: Members Posts: 582 Joined: 12-May 06 From: Colorado, USA Member No.: 30694 |
Even though I sent in results, they didn't get included at all, neither accepted nor rejected. I sent them to the correct address on the last day (Aug. 20). Anyway, going through them now, I found that I kind of botched #04, which was the clip from the intro to OMD's "Enola Gay".
The second half of the clip has the hi-hats panned slightly to the right of center. All the encoders except ffmpeg(!) seem to put the hi-hats pretty much dead-center. The panning is so slight that I didn't notice the issue at all until the 5th comparison, which for me was Apple CVBR (Sample04_2). In that comparison, the panning was the only difference I noticed. At that point, I should've checked the reference clip, but instead I checked my previous answers, and found that on one comparison (which turns out to be ffmpeg), the encoded version sounded really bad, but the hi-hats were panned slightly to the right, so I incorrectly guessed that the panning was an artifact. Oops. So for comparison #5, I guessed that CVBR was the original and I rated the actual original as inferior. Now I feel like I should've listened to the reference clip at that point and spotted it there, then gone back and changed my answers on the previous comparisons. But instead I just left them as they were:
If I were to go back after noticing the panning issue and listen for it in the comparisons I had already completed, I would've noticed it in #1 and #4, and would've correctly spotted and downgraded my ratings for the encoded clips. But doesn't going back like that, once I've told myself what to listen for, invalidate those results? If I only "naturally" notice the panning some of the time, shouldn't that just be accepted? This post has been edited by mjb2006: Aug 25 2011, 22:54 |
|
|
|
Aug 26 2011, 03:30
Post
#38
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
First, thank you IgorC and everyone involved! Thank You too for your complete 20 results. How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far: 1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only 34_GECKO_test??.txt). 2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo" 3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)? b) Can step 1. be done more efficiently? c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand) a) Both are fine. Though I'm also interested to hear Garf on this subject. b) Yes, there is easier way. There is "Sorted by listener" folder. Find folder with your results ("34_GECKO"), rename it to "Sample01" and run chunky on it. c) You should copy-paste all results (results01, results02, ... , results20) to results_AAC_2011.txt. Without spaces or comments. You will have 280 results totally: sample01 - 21 results, sample02 - 20 results, ... etc. -> summary: 280 results. If you have any issues then see "results_AAC_2011.txt". mjb2006 I do not accept the results after the closure of the test (evening 20 Aug). Your results would be discarded anyway. Your results for samples 03 and 04 are invalid. Two invalid results from your total 5 results (01,02,03,04,05)-> means complete discard. Read the rules.txt I've repeated many times to send single results as soon as possible to re-do them in case of errors. And your results are dated by 26 July. There is nobody to blame. This post has been edited by IgorC: Aug 26 2011, 03:47 |
|
|
|
Aug 26 2011, 05:18
Post
#39
|
|
|
Group: Members Posts: 582 Joined: 12-May 06 From: Colorado, USA Member No.: 30694 |
I do not accept the results after the closure of the test (evening 20 Aug). Your results would be discarded anyway. I'm not upset, and I did not wish to imply that I was arguing about whether my results should have been considered valid. Clearly they are not.Besides, I see now what happened. On 20 Aug I realized I would not have time to do more tests, so I checked the thread, and you had not yet made your post saying the test was closed, so I RARed my old results (file modification time 16:04:09-0600) and sent them (email time 16:05:34). I see now that you posted in that very short interval (post time 16:04:xx). And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests. This meaning is not at all obvious when you said that sending results early "helps to prevent some simple errors related to ABC-HR application or any other at early stage," which sounds like you're referring to logistical issues and also seems to be the only time you mentioned it in the test thread, not something you "repeated many times." Anyway, is it normal for ~27% of listeners to have their results discarded? |
|
|
|
Aug 26 2011, 05:24
Post
#40
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
Anyway, is it normal for ~27% of listeners to have their results discarded? Yes, it is normal. The quality is pretty high. And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests http://www.hydrogenaudio.org/forums/index....st&p=764051 http://www.hydrogenaudio.org/forums/index....st&p=763480 This post has been edited by IgorC: Aug 26 2011, 05:43 |
|
|
|
Aug 26 2011, 07:16
Post
#41
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)? Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this. This is a cartoon explaining what happens if we WOULDN'T do that: http://www.xkcd.com/882/ QUOTE c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand) This must be done by hand. Chunky has a bug where by default it slams all listeners together per sample in its final result (so you end up with a result as if only a single person had taken the test). |
|
|
|
Aug 26 2011, 07:20
Post
#42
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
CVBR and TVBR are statistically tied. One did not do better than the other, not even slightly. More correctly: One did do better than the other (the means are not equal). But there's still a high enough probability that that result was due to random chance instead of one codec being better than another, so we don't want to make any conclusions based on it. |
|
|
|
Aug 26 2011, 08:51
Post
#43
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this. This is a cartoon explaining what happens if we WOULDN'T do that: http://www.xkcd.com/882/ Then I completely misunderstood it and I apologize for my wrong answer to Gecko. |
|
|
|
Aug 26 2011, 10:28
Post
#44
|
|
![]() Group: Members Posts: 913 Joined: 15-December 01 From: Germany Member No.: 662 |
Thank you IgorC and Garf for answering my questions!
In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"? During the test I had the feeling that one (regular) sample was often worse and one often a tad better. Only the former assumption seems to be backed by my data (using the adjusted p-values): ffmpeg < Nero < All other Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties. |
|
|
|
Aug 26 2011, 11:20
Post
#45
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties. That's actually how bootstraps works. While it's the same for chunky if one particular sample has 1 or 100 results. It will throw away all particular data and will work only with average score per samples (only 20 average scores. Because there were 20 samples) Bootstrap in change performs analysis on whole set of data (280 results for this test.) That permits to get more important statistical differences. Literally every single result is useful. This post has been edited by IgorC: Aug 26 2011, 11:35 |
|
|
|
Aug 27 2011, 17:39
Post
#46
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
|
|
|
|
Aug 27 2011, 17:48
Post
#47
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties. That's actually how bootstraps works. The result has nothing in particular to do with bootstrap. The Friedman's tool "classic" ANOVA shows the same results. The problem is that Chunky has a bug that was throwing away most listeners and this wasn't realized for a while. With 280 submissions, quite some conclusions can be made. What is bothersome is that the first samples were (once again) tested and hence weighted more. Maybe for the next time we should add a small DB and offer sample downloads one-by-one, after calculating from the DB what the least tested samples are. |
|
|
|
Aug 27 2011, 20:22
Post
#48
|
|
![]() Group: Developer Posts: 2986 Joined: 2-December 07 Member No.: 49183 |
Basically, the graphics suck, but they look cute Exactly. So, the graphs: Ratings for all 20 samples: (without low anchor) ![]() Sorted by Nero rating: ![]() Without Nero, sorted by CT rating: ![]() CVBR, TVBR and FhG, sorted by FhG rating: ![]() All encoders, sorted by their ratings independently(!):
|
|
|
|
Aug 27 2011, 21:46
Post
#49
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
I found the first and the last graphs to be particularly informative.
From first graph it's easy to see that half of AAC encoders did excellent on male English speech (Sample 18). The same is valid for female English speech (Sample 06). Then it's correct to say that modern high quality AAC encoders perform very well on speech at 96-100 kbps. I think we missed a category of sample as "speech with some background sounds". It's different from song with voice (Sample 17). Though I haven't seen such samples among repositories with testing samples. |
|
|
|
Aug 27 2011, 22:20
Post
#50
|
|
![]() Group: Developer Posts: 2986 Joined: 2-December 07 Member No.: 49183 |
|
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 24th May 2013 - 09:43 |