Personal evaluation at ~130..135 kbps, 200 samples, AAC (iTunes, Nero) - MP3 - Vorbis aoTuV |
![]() ![]() |
Personal evaluation at ~130..135 kbps, 200 samples, AAC (iTunes, Nero) - MP3 - Vorbis aoTuV |
Nov 15 2005, 09:14
Post
#1
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
Preliminary notes
Two years ago I performed and published my two first listening tests. Both included different formats and encoders at ~130 kbps and involved a dozen of samples: classical music only. My purpose was to see which encoder was able to produce the best encoding at a friendly bitrate (friendly for portable players), and for a specific kind of music. iTunes AAC & WMAPro appeared to be the best encoders (for myself), and the absolute quality of both encoders at such bitrate surprised me. Last year (December 2004) I performed two similar tests: the first was dedicated to AAC (Nero, Apple, old and new encoders) and the second was a match between the best AAC encoder (Nero Digital “fast” VBR) and the most advanced Vorbis one (aoTuV beta 3). Quality and enjoyment were even higher! This year I performed a fresh multiformat listening test at 130 kbps. This new test is very different from their predecessors from a methodological point of view. I progressively improved my approach of listening tests and tried to answered to all criticism addressed in the past to previous tests (and not necessary mine). Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology; to sum them up: • diversity : the following test is not only based on “classical” music, and will also include several (fifty!) samples of “modern” music. • grading : described once as “temperamentic” I decided to stick all marks between two anchors, a low and a high one. It will decrease the contrast between different encoders and increase at the same time the difficulty of the full exercise but it should also ensure a more accurate grading. The low anchor is vital to prevent an excessively harsh grading; the high anchor is essential to temper enthusiasm: a very good encoding at 130 kbps should be marked in regard to an excellent and high bitrate one. The presence of both anchors should guarantee a right grading: not too low, not too high. • complexity : people reproached to some listening tests to focus only “critical” or “complex” samples. It may be a problem with some VBR implementations, which sometimes decrease too much the bitrate on “non-complex” samples. In my opinion, a listening test should include both types of samples, at least to verify that non-complex/low bitrate parts are as well encoded as complex/high bitrate ones. Usually, VBR encoders handle very well non-complex part. Usually… The complexity range of my gallery of samples is wide enough to represent all situations (from ultra-low bitrate to ultra high ones) and to check the strength of VBR implementations. • abundance : a bunch of 12…15 samples is maybe not enough to give an accurate idea of the strength and weakness of different encoders. I experienced it myself in the past: my previous tests didn’t reveal some problems I only noticed after on real usage, and more important, they were unable to expose the recurrence of the detected problems. Detecting one problem (like rumbling or ringing) is one thing, measuring the periodicity of this problem is another thing. My test is based on 200 samples; this number should be enough to expose all common problems plus several uncommon ones and is also sufficient to get an idea of their redundancy. This is in my opinion the biggest advantage of my personal listening test over collective ones (which must stay friendly to avoid discouragement and attract a lot of testers). • statistical analysis : it might appear as trivial to mention this, but statistical analysis of results and confidence bars are presents (they were not used last year and the year before). • “Apples and Oranges” : no need to recall the problem. This test only mobilizes VBR encoders. No debate this time. THE TEST: CHOICE OF ENCODERS The market of audio encoders is ruled by a Darwinian process: the stronger only survive. Between my first test (October 2003) and this one (november 2005), only few encoders really progressed. Most other (some of them are still in use) are unchanged or only changed once: MPC, WMA (Standard and Pro), faac, all MP3 encoders (excepted LAME). Another one appeared and disappeared in the meantime (Compaact!). On the hardware side, the situation is now very different from the one I lived two years ago. With the exception of one or two devices, AAC and Vorbis support in hardware players were more a dream than reality. Testing different audio formats was useful for a virtual and opened future, rich in dreams and promises. Now, the concrete situation is more interesting than dreams. MP3 and WMA (Std) are still the two well-established formats, but Vorbis now benefits from a growing interest of several manufacturers and if AAC still looks like an Apple monopoly the iPod market has at least mutated into several form (flash memory players, Microdrive™ based jukebox). One victim of reality is WMAPro, still not supported; and the growing popularity of WMA labeled as PlaysForSure (based on WMA Std) seems to sentence WMAPro to a long exile. For all these considerations, I restricted the test to the most usable and interesting encoders: AAC-LC (highly developed by Apple and Nero Digital), MP3 (vigorous as ever, thanks to LAME devs), Vorbis (saved from inertia by Aoyumi). Besides these four encoders, I add two anchors. More precisely: • Apple AAC: I used iTunes 6.0.0.18 (based on QuickTime 7.03), at 128 kbps and with the recently added VBR mode . I test Apple AAC in VBR for the first time. I sadly discovered that this encoder use the same trick as the MP3 encoder included in iTunes: the minimal size of the frames are not inferior to the targeted bitrate (apart maybe digital silence). In other words, for 128 VBR encodings the bitrate starts at 128 kbps and is increased with complexity. No need to precise that if average bitrate stays close to the target, the variations are necessary limited. One advantage: this restricted mode prevents the VBR engine to use inadequately low bitrate frames, and should guarantee quality from bad surprises compared to a CBR encoding. • Nero Digital AAC: I used the very new encoder released two weeks ago (aac.dll v.3 and aacenc32.dll v.4.2.1.0 ), in VBR mode too. –internet profile is the closed to 128 kbps (slightly inferior with classical music, but higher with non-classical. I didn’t use the “fast” mode, which is now pretty similar but probably inferior to the “high” one. • LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new. • Vorbis: I used aoTuV beta 4 (4.5 was released during the testing phase) instead of official 1.1.1 which corresponds to the 18 months old aoTuV beta 2 version. I used –q4,25 for the first group and –q4,00 for the second. • As low anchor, I looked for something really low and also usable in batch mode. I found a very old AAC encoder on ReallyRareWares called mbaacencoder version 0.3: it’s awfully slow, quality is terrific and is as anecdote ideal to get an idea of all progress made around AAC between 1999 (release date of mbaaencoder) and 2005 (Apple and Nero Digital). I tried to get joint stereo and LC profile in batch mode, but the encoder apparently stayed in default mode (Main Profile, 128 kbps and dual stereo). • As high anchor, I didn’t hesitate and used LAME 3.97 beta 1 –V2 --vbr new (or --preset standard) which is a reference for efficient, high quality and universal encodings. Furthermore, it would be interesting to evaluate the remaining gap between modern implementation of AAC and Vorbis at ~128 kbps to HQ MP3 at ~192 kbps. SAMPLES The test hinges on two big groups of samples: 150 for “classical” music group and 50 for “non-classical” (or “various”, or “modern”, or “popular”… choose your own) group. I already used the first group in three different tests in the past (80 kbps, 96 kbps, and LAME –V5). The complete collection is available for download. The 2nd group consist on all (35) non-classical samples used in previous collective listening tests; they’re all available on rarewares. To decrease the gap between the first and the second group I’ve add 15 other samples, all recently submitted for the postponed 64 kbps listening test of Sebastian Mares. Most of these last files may still be available. THE BITRATE The bitrate comparison is more accurate for the first group: it’s based on full tracks (6min 30 sec. per file on average) instead of short samples (10 sec. on average), and the complete collection is last but not least very representative of my entire library. For the second group of samples, I proceeded differently and I based the bitrate calculation on the 50 samples (which are longer: 24 sec. on average) and on external data (bitrate table for LAME posted by someone else). This way to evaluate the bitrate is not very precise, but I don’t have enough material to build a more accurate bitrate table. That’s why I tried to lower at maximum the difference in bitrate for all settings, and changed the command line for Vorbis (from –q4,25 to –q4,00) and LAME (--athaa-sensitivity 1 was removed). To sum up the datas (a complete bitrate table will follow in the next days): CODE CLASSICAL (full tracks) low anchor 128,00 kbps (estimated) AAC iTunes 133,33 kbps [+4,16 %] AAC Nero 125,71 kbps [-1,79 %] MP3 LAME 130,81 kbps [+2,20 %] Vorbis aoTuV 131,69 kbps [+2,88 %] high anchor 181,46 kbps [+41,77 %] NON-CLASSICAL (short samples) low anchor 128,00 kbps (estimated) AAC iTunes 137,31 kbps [+7,27 %] AAC Nero 134,10 kbps [+4,76 %] MP3 LAME¹ 137,82 kbps [+7,67 %] Vorbis aoTuV² 133,42 kbps [+4,23 %] high anchor 196,28 kbps [+53,34 %] ¹ with --athaa-sensitivity 1 bitrate reaches 139,38 kbps ² with –q4,25 bitrate reaches 140,21 kbps TESTING CONDITIONS The full test consists on pure ABC notation. The double blind test conditions are ensured by schnofler ABC/HR 0.5 beta (2005.08.31) software. All samples were decoded by CLI decoded within ABC/HR; offset were removed each times and minor differences in gain were systematically corrected (the highest difference reached 1.2 dB). Small mention for Vorbis: all files were decoded with foobar2000 (I still can’t make ABC/HR decode Vorbis files). There are no ABX comparisons: it’s a luxury I can’t afford with 1200 files awaiting for evaluation (200 x 6). If a difference is really unsure, I don’t rank the file. I finally ranked 16 times the reference instead of the encoded one (and 6 mistakes concern the high anchor). The error is inferior to 1.5%. I didn’t discard the errors from the final results (they don’t have a significant impact). My hardware setting: Beyerdynamic DT-531 headphone; Audigy2 soundcard; Onkyo A-5 amp. DREAM AND REALITY… Last words before posting the results: I planed to write a complete review, including a complete synthesis on most common problems encountered in this test. Different encoders have different problems, and some of them are recurrent. As example, LAME produce often weird kind of rumbling (noise in low frequencies) and smearing; Vorbis has still issues with what I called “microdetails” (blurred and replaced by noise) and sometimes coarseness; iTunes suffers sometimes from a form of ringing I can’t define; Nero Digital has serious troubles on tonal passage and poor pre-echo performance. I didn’t compile this memento yet, which should interest developers more than users. But I publish the results yet, because I feel that it’ time for me to close this test (honestly, seeing ABC/HR running somewhere drives me mad or sick). Results are published as big png files; file size is not an issue (only 111 kb) but the image size may cause issue on small display resolution (800x600). I apologize for inconvenience. Small comments are ending the graphs. Here again, I planned to write more detailed comments, but until I achieve what I planed to do I fear that the week-end and maybe the month will be over. I postponed several activities during the two last weeks to perform and present this test, but I can’t continue anymore. If I remember correctly there’s a life outside ABC/HR This post has been edited by guruboolez: Nov 15 2005, 09:38 |
|
|
|
Nov 15 2005, 09:14
Post
#2
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
RESULTS
I. CLASSICAL: 5 electronic/artificial samples micro-group ![]() II. CLASSICAL: 60 orchestral & chamber samples macro-group ![]() III. CLASSICAL: 55 solo instruments samples macro-group ![]() IV. CLASSICAL: 30 samples macro-group ![]() V. NON-CLASSICAL or MODERN or VARIOUS: 50 samples macro-group
This post has been edited by guruboolez: Jan 29 2006, 13:58 |
|
|
|
Nov 15 2005, 09:15
Post
#3
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
Few words to conclude the test…
It’s pretty clear that all encoders tested here correspond to a good or even a very good output quality. There are currently no winner between AAC (iTunes) and Vorbis. It’s funny to see that results are pretty close on the finish line when problems are so different. Encodings are not fully transparent, but quality is in my opinion excellent most often (but not always). LAME offers to MP3 the chance to stay competitive against AAC and Vorbis. Not fully competitive, but the efficiency of this format forces the respect. Nero Digital implementation of AAC is slightly disappointing, especially with classical music, which is still a weak point of this encoder. But the quality is far from disaster (it wasn’t the case two years ago), is on average really good, gets even better with “non-classical” music and should satisfy several users. Last but not least, difference among all these encoders is really small (don't look too much on "zoomed" plots But the average mark is somewhat misleading. LAME quality is ~0.5 point lower to iTunes or Vorbis, but it doesn’t mean for example that quality of encoded albums are 0,5 lower. This lower ranking is rather the expression of higher fragility than lower quality. LAME, and Nero Digital, are more inclined to serious distortions than Vorbis or iTunes AAC at the same bitrate. The concept of quality may be replaced with such encoders by the concept of strength or robustness. To illustrate this I made the following histogram (sorry for poor quality, I’ll change it later): ![]() Here, Vorbis and iTunes both get a mark comprise between 4.5 and 5.0 for 50% of the tested samples, whereas Nero only achieve this state (near-transparency or full transparency) for 20% of the same samples. With the classical group of samples, 30% of the them were ranked below 3.0 with Nero when iTunes or Vorbis got the same notation of less than 10% of the sample. The two winners are stronger, and could handle more situations than LAME and Nero Digital AAC. This post has been edited by guruboolez: Dec 29 2005, 22:57 |
|
|
|
Nov 15 2005, 09:30
Post
#4
|
|
![]() Group: Members (Donating) Posts: 678 Joined: 10-December 01 From: Belgium Member No.: 622 |
QUOTE (guruboolez @ Nov 15 2005, 10:14 AM) I can imagine that. Boy, performing this test must have been such a huge task... I'm extremely impressed! Thanks a lot for sharing this with us, it's very interesting (especially now that you also included non-classical music). My hat's off to you, Sir! -------------------- Over thinking, over analyzing separates the body from the mind.
|
|
|
|
Nov 15 2005, 09:34
Post
#5
|
|
|
Group: Members Posts: 471 Joined: 6-March 03 Member No.: 5360 |
bravo! You are much braver and patient than myself! It would seem that buying from the Itunes store isn't such a bad quality sacrifice going by your test.
Thanks again, your blind tests are one of the top attractions around here. |
|
|
|
Nov 15 2005, 09:42
Post
#6
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
Changes in aoTuV beta 4.5 are for inferior settings (up to -q3,00). Fortunately I would (exceptionally) say
|
|
|
|
Nov 15 2005, 09:45
Post
#7
|
|
![]() Group: Members Posts: 493 Joined: 3-June 03 Member No.: 6981 |
Once again guruboolez, thankyou for your amazingly informative tests! And thanks for subjecting your ears to rigours the of modern music..
It's also nice to see that Aoyumi's work on vorbis is keeping it at the forefront of modern audio compression. |
|
|
|
Nov 15 2005, 09:53
Post
#8
|
|
![]() Group: Super Moderator Posts: 4887 Joined: 12-August 04 From: Exeter, UK Member No.: 16217 |
Thank you guruboolez.
These tests are so important to the community. -------------------- I'm on a horse.
|
|
|
|
Nov 15 2005, 10:13
Post
#9
|
|
|
Group: Members Posts: 111 Joined: 11-December 01 Member No.: 625 |
QUOTE (guruboolez) Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology I am in awe... I always do my own personal ABX test for my personal usage, but it is nothing compared to the enormous amount of work you do. Your tests and public results are very much appreciated, thank you. edit: just finished reading the test results twice (to go through all the details), and I find it interesting that Nero still does not match Itunes, even though it uses a true VBR mode, whereas Itunes does not. I have been testing the new nero codec in VBR LC mode at lower bitrates for my W800i, and have been disappointed by it. What I did not do is compare it to Itunes. I will now. This post has been edited by arman68: Nov 15 2005, 10:24 |
|
|
|
Nov 15 2005, 10:32
Post
#10
|
|
![]() Group: Members Posts: 98 Joined: 22-February 03 From: Quebec, Montreal Member No.: 5117 |
That must have been a heap load of data to compile
Did you nose bleed? |
|
|
|
Nov 15 2005, 10:40
Post
#11
|
|
|
Group: Members Posts: 345 Joined: 5-August 03 Member No.: 8183 |
Invaluable tests again. Thank you so much. Vorbis aoTuV is the leading codec at medium bitrates (tied with iTunes AAC). And from other tests you did, Vorbis also shines at low and high bitrates. Nice to confirm that LAME -V2 --vbr-new is still superior to iTunes AAC at medium bitrates (and pretty tied with AAC ~180kbps, I guess).
|
|
|
|
Nov 15 2005, 10:41
Post
#12
|
|
|
Group: Members (Donating) Posts: 90 Joined: 30-July 03 From: New Zealand Member No.: 8083 |
As we say "down under", "Good on ya, mate!"
|
|
|
|
Nov 15 2005, 11:18
Post
#13
|
|
![]() Group: Members Posts: 250 Joined: 27-December 02 From: ROMA, Italy Member No.: 4269 |
Thanks Guru!
-------------------- Vital papers will demonstrate their vitality by spontaneously moving from where you left them to where you can't find them.
|
|
|
|
Nov 15 2005, 11:41
Post
#14
|
|
|
Group: Members Posts: 261 Joined: 8-July 04 Member No.: 15184 |
Cheers Guru, fascinating stuff.
|
|
|
|
Nov 15 2005, 11:45
Post
#15
|
|
|
Group: Members Posts: 470 Joined: 26-October 01 From: Germany Member No.: 352 |
Thanks a lot, very interesting test again!
|
|
|
|
Nov 15 2005, 11:46
Post
#16
|
|
![]() LAME developer Group: Developer Posts: 761 Joined: 22-September 01 Member No.: 5 |
Thanks Guruboolez, very informative.
About LAME encoder being not well balanced: QUOTE LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new. I'm wondering, would your result be different if the encoder settings would have been the same for classical and none classical groups? |
|
|
|
Nov 15 2005, 12:07
Post
#17
|
|
|
Group: Members Posts: 207 Joined: 11-April 02 Member No.: 1749 |
Oh. I don't think this test can be ignored only because it's done by just one person. Nero company really need some work to improve there aac implementation (maby already in Ivan's brain
Thank you guruboolez for your great work. |
|
|
|
Nov 15 2005, 12:09
Post
#18
|
|
|
Group: Banned Posts: 149 Joined: 1-September 05 Member No.: 24248 |
Thanx!
guruboolez, how about low-bitrate comparision (64kbps and below) |
|
|
|
Nov 15 2005, 12:15
Post
#19
|
|
|
Group: Members Posts: 208 Joined: 12-March 04 From: Germany Member No.: 12686 |
you must be crazy impressive work! thanks a lot guruboolez |
|
|
|
Nov 15 2005, 12:38
Post
#20
|
|
|
Group: Members Posts: 238 Joined: 22-February 04 Member No.: 12193 |
Nice listening test, as always
|
|
|
|
Nov 15 2005, 13:11
Post
#21
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
Awesome, awesome, awesome.
Very big thanks, Francis. You're a legend. -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Nov 15 2005, 13:23
Post
#22
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
QUOTE (robert @ Nov 15 2005, 11:46 AM) QUOTE LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new. I'm wondering, would your result be different if the encoder settings would have been the same for classical and none classical groups? I don't think so. The --athaa-sensitivity command prevents a specific kind of ringing (I'm used to call it "background ringing"), and I don't remember any sample of the second group suffering from this problem (there are maybe one or two of them). I already noticed this disparity in performance between classical group and "various" samples during my summer listening tests performed at 80 kbps and 96 kbps. The difference is also not very important. And as you can see it on the distributive histograms, the main difference occurs on the last part (ranking > 4.5). ~40% of the tested samples (classical) were ranked below 4.5 with LAME, but the proportion falls to 20% for the second category. It seems that for LAME, there are more "easy" to handle situation in my sample gallery than for the 50 samples I collected from various listening tests. (I don't know if I'm really clear...). |
|
|
|
Nov 15 2005, 13:28
Post
#23
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
|
|
|
|
Nov 15 2005, 13:36
Post
#24
|
|
![]() Group: Members Posts: 730 Joined: 5-January 04 Member No.: 10970 |
Surprising to see how close Vorbis and iTunes are to the high anchor. I guess one could safely use 160kbps VBR for transparency with iTunes now (I previously used 192kbps).
|
|
|
|
Nov 15 2005, 14:27
Post
#25
|
|
![]() Group: Developer Posts: 1245 Joined: 16-December 02 From: Australia Member No.: 4097 |
To guruboolez, thank you for yet another incredibly fascinating and informative listening test.
I am again very pleased to see Vorbis doing so well. Full credits to Aoyumi for his wonderful work. I'm also very pleased to see iTunes AAC doing so well too. It seems we do get value for money with these two encoders (ie. they're free!!! even better |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 18th May 2013 - 13:54 |