Multiformat listening test @ ~64kbps: Results, Results and post-test discussion |
![]() ![]() |
Multiformat listening test @ ~64kbps: Results, Results and post-test discussion |
Apr 12 2011, 00:40
Post
#1
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
The test is finished, results are available here:
http://listening-tests.hydrogenaudio.org/igorc/results.html Summary: CELT/Opus won, Apple HE-AAC is better than Nero HE-AAC, and Vorbis has caught up with Nero HE-AAC. |
|
|
|
Apr 12 2011, 01:02
Post
#2
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
If someone can assist with a bitrate table or per-sample results, that would be nice...
|
|
|
|
Apr 12 2011, 01:06
Post
#3
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
Oh, and given that Opus is open sourced, if one of the developers can give a technical explanation for our audience on what codec features and design decisions made them able to win this test, that would be pretty damn interesting, too
|
|
|
|
Apr 12 2011, 01:14
Post
#4
|
|
|
Group: Members Posts: 8 Joined: 28-May 08 Member No.: 53862 |
I just wonder one thing, when the Vorbis encoder was tested how was it lowpassed. Was it tested with the default 14 kHz lowpass?
-------------------- 256 kbps Apple AAC bought iTunes music
|
|
|
|
Apr 12 2011, 01:15
Post
#5
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
I just wonder one thing, when the Vorbis encoder was tested how was it lowpassed. Was it tested with the default 14 kHz lowpass? You can see the exact settings used for each codec here: http://listening-tests.hydrogenaudio.org/igorc/index.htm |
|
|
|
Apr 12 2011, 01:22
Post
#6
|
|
|
Group: Members Posts: 8 Joined: 28-May 08 Member No.: 53862 |
You can see the exact settings used for each codec here: http://listening-tests.hydrogenaudio.org/igorc/index.htm ah thanks, sorry did not see it. It says -q 0.1 so I assume it was the default 14 kHz lowpass -------------------- 256 kbps Apple AAC bought iTunes music
|
|
|
|
Apr 12 2011, 03:08
Post
#7
|
|
![]() Group: Members Posts: 607 Joined: 16-January 09 Member No.: 65630 |
Congratulation to CELT/Opus!
I wanted to compare ratings by testers per sample, but it seems that every tester gets random testing sequence Is there any way I can get such data, and get wanted plot - if it's not clear I want to know source sample formats (for 5 rating bins) for each tester Thanks edit: nevermind, I found a way - it seems that sample name appendixes are same (those describing 5 bins at header of each test result) This post has been edited by romor: Apr 12 2011, 03:24 -------------------- Scripts (mainly foobar2000 related): http://goo.gl/yje3h
|
|
|
|
Apr 12 2011, 03:50
Post
#8
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
I think the results of lessthanjoey and AlexB are also anonym. It will be changed.
If anyone is interested in his/her test there is key or email me and I will send the results. oh, I have participated in this test too. Garf had the key for my results and had checked them. It's also good to keep strong words like "thank you, great job". But this time I want to say big Thank You to all participants and people who has helped to conduct these test. Sebastian Mares - for his previous public tests. This test benefited much from them. AlexB - for providing pre-decoded packages and being here. Especially, Garf. And many other people who were here around. Your time is valuable and highly aprreciated. This post has been edited by IgorC: Apr 12 2011, 04:20 |
|
|
|
Apr 12 2011, 08:06
Post
#9
|
|
|
Group: Members Posts: 698 Joined: 6-March 10 Member No.: 78779 |
I'm stunned by the CELT/Opus results! I would have assumed that your toolbox is smaller than usually when you are targeting low-delay. And now Celt even beats the others by lengths.
Thanks for the great work, guys! |
|
|
|
Apr 12 2011, 12:59
Post
#10
|
|
![]() Group: Members Posts: 1303 Joined: 14-September 05 From: Helsinki, Finland Member No.: 24472 |
Thanks guys! Interesting results.
One note though: CODE Read 5 treatments, 531 samples => 10 comparisons Means: Vorbis Nero_HE-AAC Apple_HE-AAC Opus AAC-LC@48k 3.513 3.547 3.817 3.999 1.656 For processing the result .txt files with chunky I organized them to sample folders. I removed the results that were marked "invalid" and results that apparently had a fixed newer version (marked as such). I had a duplicate problem with romor's results (a couple of duplicates in a subfolder), but I decided to keep the newer result files. I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference. Did you disqualify more results after creating the rar package or does "531 samples" mean something else than the total number of result files? Here's how chunky parses the 566 result files I have: CODE % Result file produced by chunky-0.8.4-beta
% ..\chunky.exe --codec-file=..\codecs.txt -n --ratings=results --warn -p 0.05 % % Sample Averages: Vorbis Nero Apple CELT Anchor 2.56 4.28 4.19 2.67 1.87 2.95 4.20 4.03 2.36 1.68 3.42 3.51 3.98 4.73 2.51 4.12 3.84 4.49 4.64 2.18 4.18 3.59 3.87 4.52 1.95 3.35 3.68 3.34 4.00 1.56 3.86 2.98 2.96 3.50 1.85 4.03 3.78 4.09 4.49 2.02 3.60 3.71 3.89 3.94 1.51 4.28 2.78 2.19 4.12 1.44 4.12 3.93 4.17 4.39 1.70 3.25 3.18 3.20 4.14 1.77 3.83 3.63 3.86 4.56 1.41 3.49 3.81 4.01 4.27 1.37 4.15 3.84 4.08 4.76 2.04 3.97 2.74 3.09 4.38 1.74 3.35 3.24 4.15 4.44 1.56 2.68 2.96 3.63 4.10 1.51 3.58 4.37 4.88 3.73 1.76 3.40 4.10 4.68 4.26 1.61 3.80 3.49 3.55 4.43 1.38 3.81 3.30 4.27 4.26 1.13 3.59 3.14 3.51 4.09 1.18 3.29 3.61 3.88 4.16 1.36 3.66 3.84 4.37 3.86 1.55 2.78 3.99 4.18 2.82 1.57 3.62 3.88 3.92 3.93 1.34 3.39 4.03 4.39 3.96 1.46 3.61 4.12 4.36 4.09 1.54 4.42 3.48 4.29 4.68 1.82 % Codec averages: % 3.60 3.63 3.92 4.08 1.65 -------------------- http://listening-tests.freetzi.com
|
|
|
|
Apr 12 2011, 13:53
Post
#11
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference. Edit: See below. This post has been edited by Garf: Apr 12 2011, 15:28 |
|
|
|
Apr 12 2011, 14:14
Post
#12
|
|
![]() Group: Members Posts: 1303 Joined: 14-September 05 From: Helsinki, Finland Member No.: 24472 |
For comparison I uploaded a rar package of my "chunky" folder. it contains the reorganized result files and phong's chunky (Windows version). The command line I used is in the instructions.txt file
I had to partially rename the result files to reorganize them into the sample folders. In addition I needed to change all r.wav strings inside the result files to .wav before chunky could work. I batch processed the files with Notepad++. I believe it was a "safe" edit. The package is here: http://www.hydrogenaudio.org/forums/index....showtopic=88033 This post has been edited by Alex B: Apr 12 2011, 14:45 -------------------- http://listening-tests.freetzi.com
|
|
|
|
Apr 12 2011, 14:48
Post
#13
|
|
![]() Group: Developer Posts: 191 Joined: 8-July 03 Member No.: 7653 |
QUOTE For processing the result .txt files with chunky I organized them to sample folders. I removed the results that were marked "invalid" and results that apparently had a fixed newer version (marked as such). I had a duplicate problem with romor's results (a couple of duplicates in a subfolder), but I decided to keep the newer result files. I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference. Did you disqualify more results after creating the rar package or does "531 samples" mean something else than the total number of result files? Sounds like you didn't eliminate the listeners with more than 4 invalid results. The filtering rules on the page are: * If the listener ranked the reference worse than 4.5 on a sample, the listener's results for that sample were discarded. * If the listener ranked the low anchor at 5.0 on a sample, the listener's results for that sample were discarded. * If the listener ranked the reference below 5.0 on more than 4 samples, all of that listener's results were discarded. You'll have to modify chunky to get the that behavior. This post has been edited by NullC: Apr 12 2011, 15:12 |
|
|
|
Apr 12 2011, 14:49
Post
#14
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
For comparison I uploaded a rar package of my "chunky" folder. it contains the reorganized result files and phong's chunky (Windows version). The command line I used is in the instructions.txt file I had to partially rename the result files to reorganize them into the sample folders. In addition I needed to change all r.wav strings in filenames to .wav before chunky could work. I batch processed the files with Notepad++. I believe it was a "safe" edit. The package is here: http://www.hydrogenaudio.org/forums/index....showtopic=88033 Thanks, I didn't have the triaged results here, so this was welcome. By the way, chunky has quite dangerous behavior: by default, it squashes all listeners together per sample for the overall results. In other words, its discarding most of the information in the test, as if only a single listener did all samples! The per-sample results don't suffer from that, so those should be fine. Edit: Whoops, I indeed missed some results that should have been discarded. This post has been edited by Garf: Apr 12 2011, 15:18 |
|
|
|
Apr 12 2011, 14:59
Post
#15
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
Sounds like you didn't eliminate the listeners with more than 4 invalid results. The filtering rules on the page are: * If the listener ranked the reference worse than 4.5 on a sample, the listener's results for that sample were discarded. * If the listener ranked the low anchor at 5.0 on a sample, the listener's results for that sample were discarded. * If the listener ranked the reference below 5.0 on more than 4 samples, all of that listener's results were discarded. You'll have to modify chunky to get the that behavior. Ah, good point. There were two discarded listeners, I got those. I saw one result with a rated reference that didn't cause an invalidation, so got that correctly. But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now. |
|
|
|
Apr 12 2011, 15:02
Post
#16
|
|
![]() Group: Members Posts: 1303 Joined: 14-September 05 From: Helsinki, Finland Member No.: 24472 |
Sounds like you didn't eliminate the listeners with more than 4 invalid results. I removed two folders (= listeners) before doing the tasks I mentioned: - 09 (too many invalid results. The listener has never answered any email) - 27 (something gone wrong or cheater ) I trusted the comments in the folder and file names. I did not look inside each and every result file. -------------------- http://listening-tests.freetzi.com
|
|
|
|
Apr 12 2011, 15:09
Post
#17
|
|
![]() Group: Members Posts: 1303 Joined: 14-September 05 From: Helsinki, Finland Member No.: 24472 |
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now. Perhaps "low anchor" would be more accurate. -------------------- http://listening-tests.freetzi.com
|
|
|
|
Apr 12 2011, 15:17
Post
#18
|
|
![]() Group: Developer Posts: 191 Joined: 8-July 03 Member No.: 7653 |
Sounds like you didn't eliminate the listeners with more than 4 invalid results. I removed two folders (= listeners) before doing the tasks I mentioned: - 09 (too many invalid results. The listener has never answered any email) - 27 (something gone wrong or cheater ) I trusted the comments in the folder and file names. I did not look inside each and every result file. Ah, okay! (moving and amending from my edited post, since others already replied. Sorry) The users which should have been excluded according to that rule are 09, 27, and 22 but IgorC decided to keep 22 (because 22 didn't understand the procedure at first but got better later) and I expected 21 to be filtered too (because he only rated the low anchor on almost all the samples: 23/30 are either low anchor only or invalid, including many of the really obvious ones). This post has been edited by NullC: Apr 12 2011, 15:19 |
|
|
|
Apr 12 2011, 15:35
Post
#19
|
|
![]() Group: Members Posts: 1303 Joined: 14-September 05 From: Helsinki, Finland Member No.: 24472 |
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now. I found six "low anchor = 5.0" instances (I outputted a csv file from chunky and sorted the data by the low anchor column in Excel) My math says 560. (or did you actually remove the "rated but accepted reference" instance?) -------------------- http://listening-tests.freetzi.com
|
|
|
|
Apr 12 2011, 15:42
Post
#20
|
|
![]() Server Admin Group: Admin Posts: 4808 Joined: 24-September 01 Member No.: 13 |
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now. I found six "low anchor = 5.0" instances (I outputted a csv file from chunky and sorted the data by the low anchor column in Excel) My math says 560. (or did you actually remove the "rated but accepted reference" instance?) No. But after running chunky I only had 565, not 566 files. It appears to reject one input file for some reason (this is on Linux). A lesson here is that the post-screened data-set should be published, too, because it's easy to make mistakes there and it makes it easier for people wanting to do other/further analysis. But considering the comment from NullC the results on the site are probably correct. |
|
|
|
Apr 12 2011, 16:14
Post
#21
|
|
![]() Group: Members Posts: 1303 Joined: 14-September 05 From: Helsinki, Finland Member No.: 24472 |
Regarding the bitrate table,
I guess that CELT/Opus is not supported in any program that can display and/or export accurate bit rate data. If the bitrate needs to be calculated from the file size should the size of the ogg container data be reduced from the file size before performing the calculation? What would be the correct amount? Would the bitrate value then be comparable with the values that foobar shows for the other contenders? (It is quite simple to export bitrate data from foobar.) -------------------- http://listening-tests.freetzi.com
|
|
|
|
Apr 12 2011, 16:15
Post
#22
|
|
|
Group: Members Posts: 13 Joined: 8-March 11 Member No.: 88816 |
Sounds like you didn't eliminate the listeners with more than 4 invalid results. I removed two folders (= listeners) before doing the tasks I mentioned: - 09 (too many invalid results. The listener has never answered any email) - 27 (something gone wrong or cheater ) I trusted the comments in the folder and file names. I did not look inside each and every result file. # 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater. Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet. |
|
|
|
Apr 12 2011, 17:54
Post
#23
|
|
![]() Group: Developer Posts: 191 Joined: 8-July 03 Member No.: 7653 |
# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater. Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet. I think it's really unfortunate that Igor released a file with the word cheater in it. There are so many ways for a result to go weird which have nothing to do with "cheating". Your results can be excluded purely based on the previously published confused reference criteria (2,4,9,22,30 invalid), so that should close the question on correctness of excluding those results and it should have been left at that. Even with good and careful listeners this can happen, and it's nothing anyone should take too personally. Though, your results are pretty weird— You ranked the reference fairly low (e.g. 3) on a couple comparisons where many people found the reference and codec indistinguishable. I think you also failed to reverse your preference on some samples where the other listeners changed their preference (behavior characteristic of a non-blind test?). I don't mean to cause offense, but were you listening via speakers or could you have far less HF sensitivity than most of the other listeners (if you are male and older than most participants then the answer to that might be yes)? Any other ideas why your results might be very different overall and also on specific samples? This post has been edited by NullC: Apr 12 2011, 18:23 |
|
|
|
Apr 12 2011, 18:10
Post
#24
|
|
![]() Group: Developer Posts: 191 Joined: 8-July 03 Member No.: 7653 |
Regarding the bitrate table, I guess that CELT/Opus is not supported in any program that can display and/or export accurate bit rate data. If the bitrate needs to be calculated from the file size should the size of the ogg container data be reduced from the file size before performing the calculation? What would be the correct amount? Would the bitrate value then be comparable with the values that foobar shows for the other contenders? (It is quite simple to export bitrate data from foobar.) If you wish to remove container overhead for the Vorbis and Opus files you can use a tool like ogg-dump from oggztools to extract all the packet sizes. On a few samples Vorbis suffers a bit because the Vorbis headers are fairly large compare to an 8 second 64kbit/sec file (e.g. Sample01) but I don't think the container overhead is all that considerable. This post has been edited by NullC: Apr 12 2011, 18:15 |
|
|
|
Apr 12 2011, 18:13
Post
#25
|
|
|
Group: Members Posts: 1315 Joined: 3-January 05 From: Argentina, Bs As Member No.: 18803 |
Yes, I was too strict. Sorry about it.
Some of the listeners prefer Nero over Vorbis or vice versa. Some of them have rated Vorbis higher against HE-AAC codecs. Other preferred Apple HE-AAC over CELT on second half of samples. These variations are all fine. Finally on average Opus/CELT was better for all listeners with enough results. It was very strange that you have ranked the Opus as low as low anchor! (like sample 10 and much others) where ALL other listeners scored it very well. You average scores (including 5 invalid samples): Vorbis - 3.53 Nero - 3.15 Apple -3.51 CELT - 2.34 Maybe your hardware has some issues. Earlier I also wrote you to re run again the whole test because there were 5 invalid results and all test was discarded. This post has been edited by IgorC: Apr 12 2011, 18:18 |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 20th May 2013 - 09:14 |