MPC vs OGG VORBIS vs MP3 at 175 kbps, listening test on non-killer samples
MPC vs OGG VORBIS vs MP3 at 175 kbps, listening test on non-killer samples
Jul 12 2004, 00:50
Group: Members (Donating)
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420
• My access to internet is now very limited. Therefore, the encoder I’m using for my tests are not necessary the most recent available on the web. Here, tests were done when vorbis 1.1 RC1 was released, but I didn’t have access to this information…
• This test is something like a work-in-progress. I plan to add more results with time.
I. PURPOSE OF THE TEST
Like many people of this board, my principal motivation for audio encoding lie in the possibility to listen and enjoy music in high quality directly from computer, which allows a very fast browsing and the access to an entire record collection. High quality encoding is a requirement, security a need. I used successively lame mp3, musepack audio and now lossless, which offer the security of identical digital-data with CD.
Nevertheless, lossy encoding is still interesting: modern hard disks are not necessary big enough for all collections, and I think that there’s some benefits to feed expensive digital jukebox with “better than just good” quality audio encodings, like AAC/Vorbis 128 – fine but perfectible.
The choice of the best lossy encoder isn’t really problematic. Musepack (mpc) is still winning most approvals, and is considered as fully transparent with --standard preset. Some elements encouraged me to seriously question this leading position of mpc.
• 1/ by testing occasionally the standard preset of mpc, I discovered that small differences are sometimes audible with usual music. Now if mpc isn’t fully transparent at 175 kbps, this format is definitively comparable (it doesn’t mean “equal”) to other lossy solution, which are suffering from the same report.
• 2/ the leading position of mpc was admitted long time ago. It was defined as “best lossy format” when challengers where not very strong: beta of vorbis, lame < 3.90, suboptimal aac encoders. But now, there are powerful vorbis encoders (the recent “megamix” merging looks like a serious challenger), optimized AAC encoders (QuickTime CBR and Nero VBR), and mature MP3 solutions (VBR presets of lame). The leading position must therefore be questioned again, at least by people able to detect differences.
• 3/ This challenge becomes necessary with the growing numbers of device supporting new audio formats like AAC and Vorbis. MPC is still confined to computer, or in best case on PDA – and is maybe doomed to this limited usage.
In consequence, I’ve tried to oppose to mpc --standard other serious encoding solutions, in order to have a better, modern and personal idea of the relative quality of this encoder compared to modern and convenient challengers.
Against musepack --standard, I decided to oppose two formats: MP3 with lame 3.97a3 and OGG VORBIS with the recent combined encoder named “megamix”. Explanations.
• first, no AAC encoder in the arena. I was tempted to use Nero AAC, but the last version I have (126.96.36.199) have some recognized quality problems and is promise to an imminent conceptual death, with the Third version of Ivan Dimkovic encoder. No need to test something outdated… I was also tempted to take QuickTime AAC, though it’s not VBR and not very flexible (nothing between 160 and 192 kbps: annoying for fair comparison with MPC --standard). But this encoder is not really suitable in my opinion for HQ listening, at least when user is found of opera and when most of his CD absolutely need a real gapless playback. AAC will be add later, but for now, it’s absent from this test.
• the choice of lame MP3 version is highly problematic too. Three choices are possible: the last “tested” release (3.90.3), the last gold release (3.96) or the last alpha release (3.97 alpha 3). I’ve decided to not use 3.90.3. I know that for some people this encoder is the best mp3 codec ever released; I also know that for historical reason 3.90.3 is probably the safest choice. But the difference between 3.90.x dead branch and the active 3.9x one is not only related to quality: 3.9x are much faster (not a luxury considering slowness of 3.90.x presets), more complete (full and redesigned VBR preset scale: the nice –V 5 used in Roberto listening test is for example a new feature inaccessible for 3.90.x), and last but not least in perpetual evolution. There’s nobody to correct flaws on 3.90.x, whereas bug audible with 3.9x could be corrected or lowered by Gabriel, Robert, Takehiro and other developers.
I definitively forget the choice of 3.90.x for another important reason: there’s no VBR preset corresponding to the MPC –standard bitrate. –alt-preset standard is clearly too high, --medium too low, -Y switch a hack, and ABR is probably not efficient enough. With 3.9x branch, there’s an existing preset between –standard and –medium: -V 3. And –V 3 average bitrate should be close to the MPC –standard one.
Then: 3.96 “gold” or 3.97 alpha? I’ve decided for the alpha release. I know the risks (for regression but also for progress). But I also know that 3.96 is buggy on –fast mode: it decides me to use a corrected release, even if the test doesn’t concern the fast mode of lame.
• the choice of vorbis version is less problematic. Recent tests were done. CVS/GT3b2 couldn’t resists against aoTuV/GT3/QK32 dream team (aka megamix), at least up to 5,99. And even higher, GT3b2 (previous reference encoder for high bitrate) doesn’t really sound superior (except maybe for one family of problems: micro-attacks). I also recall that I’ve began this test by being unaware of the release of 1.1 RC1. This last encoder nevertheless seems to be inferior to “megamix” (the essential but maybe ‘excessive’ tuning of Garf, used at bitrate > -q 5,00, are apparently missing from this RC1 version). The use of “megamix” is therefore pertinent, and my test is probably not outdated by this enjoying pre-release of oggenc 1.1
• I don’t forget the promising WMApro: I was really pleased and even enthusiastic by the quality reached by this format with classical music at mid-bitrate. Nevertheless, I didn’t include this format in the test. First, I had to limit the number of competitors. Then, I’m not familiar with this encoder and don’t know what setting is the best (which VBR mode? And is WMApro VBR implementation reliable, or isn’t ABR 2-PASS preferable, etc…). Last: still no hardware device for WMApro (though it’s not a reason to exclude an audio format from a test including MPC, it’s a disappointing situation).
Mid/High bitrate tests are, for me at least, especially painful. It doesn't mean that I hate them, quite the opposite. ...
Samples only concern « classical music », with one exception. I deliberately limited my choice on the music I like. It's not by snobbism; and it's not an egocentric attitude: other music is much harder for me to ABX, and my motivation would quickly disappear with music I don't really like. In other word, the impact of these results is VERY LIMITED: they concern my subjectivity (and only mine), and a particular genre of samples (natural instruments, recording according to high-fidelity principles - and not to the marketing “loud” one).
There are solo instruments (organ with Dom Bedos; harpsichord with Fuga; trombones with Orion II), instruments with small accompaniment (cymbals with Krall and Marche Royale, drums with Marche Royale, 2nd part), orchestra (Weihnachts-Oratorium and Platée), chorus (Ich bin der Welt abhanden gekommen) and voice ( “Dover, giustizia, amor” ). Additional information (artist, performer…) are available on file tags.
Comparing VBR encoders/settings is problematic. The ideal thing is to fix a target bitrate, and then to find the corresponding preset for each encoder. I followed the usual (and IMHO the best) methodology: the setting must be related to a wide selection of music, and not to the selected samples.
The targeted bitrate is the average bitrate of MUSEPACK --standard preset. The average bitrate can’t be evaluated precisely: it’s something comprise between 170 and 180 kbps. 175 kbps approximately. I have verified this value with classical music library, and people have reported similar value with completely different music.
The remaining task is now to find the corresponding VBR settings for LAME MP3 and Vorbis “megamix”.
The problems are beginning…
4.1. VORBIS SETTINGS
• The biggest problem lies in the average bitrate’s difference of vorbis, occurring at the same setting, depending on the kind of encoded music. Classical is bitrate friendly compared to most other stereo and modern material. With CVS encoder, I estimated this difference at 10…15 kbps on average for –q 5…6. With “megamix” (or other GT3b2 based encoders), this difference might reach 25…30 kbps for the same setting. I don’t know what to do…
- by testing vorbis with a –q value corresponding to 175 kbps for classical but 200…210 kbps on pop/rock… people may blame me for opposing to musepack an advantaged vorbis challenger.
- by testing vorbis with a –q value corresponding to a 175 kbps for pop/rock but 140 kbps on classical, the test will be pointless for me (the winner between mpc@175 and vorbis@145 isn’t very hard to guess…).
- by testing vorbis with a half-baked –q value, I fear that the test won’t corresponding to neither of both situation.
• The second big problem is related to vorbis rupture in the linearity of the quality scale. Between -5,99 and -6,00, there’s a consequent bitrate difference (~10 kbps), also corresponding to a serious quality difference, at least with vorbis 1.00 – 1.01 (including GT3b2). aoTuV (and therefore “megamix”) is based on the same code, but the tuning tried to correct or to minimize the quality gap between the two settings. I discovered that for classical music, the fair vorbis setting is very close to this 5,99 value. 6,00 is slightly to high, and I could disadvantage mpc by comparing it to vorbis –q 6,00. On the other side, I have the feeling that -q6,00 would show the full potential of vorbis, and that the extra 8…10 kbps could be worth for daily use. Would someone renounce to the correction of a quality bug at low prince (+5% increase in filesize), especially with archiving in mind? Seriously, I don’t think so.
For all these reasons, I’ve decided to put in the arena vorbis megamix at three different settings:
-q 6,00: clearly to “heavy” compared to mpc --standard with non-classical music, but interesting to test against -q 5,99 (to see if the frontier between these two settings still exists with aoTuV/Megamix/1.1)
-q 5,99: the corresponding setting for a matching bitrate with mpc –standard for classical music (still too heavy with other music), but maybe suboptimal quality for vorbis
-q 5,50: more universal setting for acceptable test against mpc --standard. It would be interesting to compare the quality difference between 5,50/5,99 and 5,99/6,00. I suspect (and fear) a much greater jump between the last pair than with the first one.
4.2. LAME SETTINGS
I discovered that bitrate of –V 3 preset (lame 3.97a3) is really close to the average bitrate of mpc --standard. This applies at least for classical music (I don’t have enough material to measure average bitrate on other musical genre). –V 3 will therefore be tested.
I’ve also decided to add –V 2 (--preset standard). The bitrate is higher, but I really want to see if this historical leading preset of lame MP3 is competitive against musepack. It would also be interesting to see how will perform lame –V 2 compared to vorbis megamix, also playable on portable player, but with bad consequences on battery life.
4.3. BITRATE TABLE
Instead of posting of a bitrate table of the short samples used for the test, I prefer posting data about more audio material. Average bitrate for ~20 albums (classical for most), and additionnal datas for track coming from 50 different CDs (+15 other in mono) are available on the following tables:
V. RESULTS AND CONCLUSIONS
First comment: I've add 10 points to each note. I had to find a solution to prevent misinterpretation of notes which could first appear as excessively severe. I didn't use low anchor for this test, and slight flaws sometimes appear as terribly annoying on such tests, lowering very much the notes. By artificially changing all notes, I also had in mind to disconnect the notation I used from the EBU scale (4= "perceptible but not annoying"; 3 = "slightly annoying", etc...).
With 10 results only, I couldn’t make strong conclusions. But some elements of conclusions are now appearing:
• MPC –standard has serious chance to be the best of the three competitors. Eight time on the first place, one time second, and never on the last. Very good performances. We could also note that –standard setting wasn’t sufficient for reaching the “transparency” level (except for the organ sample, with negative ABX tests). Nevertheless, I could seriously expect full transparency with higher setting: none of this sample (except maybe the chorus one) showed severe artifacts, but just slight differences. It’s typically the kind of “problems” that disappear with a higher bitrate. Anyway, I’m impressed, because I didn’t thought that MPC –standard was so in advance...
• LAME MP3 has few chances in my opinion to compete with vorbis and musepack at ~175 kbps. The new –V 3 setting sit on the last place eight times: too much… even with a limited set of samples. It doesn't mean that -V 3 sounds bad, but it's just inferior to modern lossy format at similar bitrate. But with improvements, who knows...
But the –V 2 setting (aka –alt-preset standard) is apparently competitive, and could fight (and sometime win) with vorbis “megamix” –q5.50 and –q5.99. Only problem: bitrate is not the same anymore (195 kbps vs something comprise between 162 and 180 kbps, but with classical music only). But it’s imperative to precise that LAME –V2 and –V3 suffers from huge artifacts (the harpsichord and the organ samples are severely wounded to my ears), whereas vorbis artifacts were never so bad (except, maybe, with Orion II sample – micro-attack problems).
To be short, LAME –V2 (--preset standard) is apparently competitive with VORBIS “megamix” –q 5,99, at least with classical music. It would be interesting to see how will perform both contenders with other kind of music at the same setting, which implies a completely different bitrate range (+10..15% with vorbis, and maybe – x% with lame).
• I expected a lot from the vorbis mixture. The progress between “megamix” and CVS are really impressive compared to CVS encoder, and I really wondered how it’ll perform against other challengers. I’m finally disappointed. For some reasons:
- First, the coarse sounding problem of the format is still audible with “megamix” up to 5,99. No need to suspect any of GT3b2 or QK32 tuning to ruin the benefits of original aoTuV in this area: the noise problem is particularly audible on “tonal” moments, encoded with pure aoTuV code (bit to bit identical samples between aoTuV encoder and megamix one). This additional noise is probably not too disturbing on daily listening, but on direct comparison with other challengers, the contrast is still annoying. The problem not really lies on noise, but on coarse rendering of voice or instruments: lack of subtlety, fat texture… I think that this problem is a legacy of internal change occurring during RC3 development of Vorbis, in spring 2002. I think I’ve established this fact at ~128 kbps some months ago (correct me if I’m wrong), and I suppose that’s still true at ~160…170 kbps, even with aoTuV (based on the same buggy “final” CVS code).
- Second reason to be disappointed: due to this remaining coarse problem occurring up to –q5,99, there’s still a consequent quality gap between this setting and the rounded -q 6,00. It’s my fault: I’ve expected from aoTuV tuning to erase the existing frontier between –q 5,99 and –q 6,00: this encoder only reduced the gap. There are ~10 kbps difference between 5,50 and 5,99 but few quality improvements. There are also 10 kbps difference between 5,99 and 6,00, but huge quality progress are audible. For a daily use of vorbis encoder, there’s no real problem with this difference: the 10 additional kbps of –q6,00 are obviously worth if someone is looking for high quality or archiving, and there’s no need to hesitate. But for my test or any similar one, this difference is much more problematic. On one side, I couldn’t oppose mpc –standard to megamix –q 6,00 on fair bases (average bitrate doesn’t match anymore). And on the other side, it’s pointless to compare mpc –standard to an handicapped vorbis setting (5,99). It’s like using musepack at –quality 4.99, which also suffers from problems (and bitrate gap) that don’t exist anymore at –quality 5.00. Cruel dilemma…
- Third reason to be disappointed: even at –q 6,00 (and 10 exceeding kbps), megamix couldn’t apparently reach the quality of musepack –standard. More samples are of course needed to enforce this beginning of conclusions, but I really fear the solution doesn’t lie on a selection of samples, but rather on further development.
As I said it at the very beginning, I consider this test as a first step. Additionnal results should and will normally complete this first phase. I expect a quick release of Nero AAC encoder in 188.8.131.52 version to add some spice to the test. External test, opposing vorbis megamix to the new 1.1 must also be done, in order to be sure that megamix is the best vorbis encoder at this bitrate.
I'd also like to see this test followed by other people. It would help to compare different HQ encoders on empirical bases. Feel free to post some results, even for one sample, on this topic.
APPENDIX. SAMPLE LOCATION
I've upload all samples on a temporary link. I couldn't keep them on-line too long. So don't wait if you're planning to do personal tests. ABX logs are available in each archive. Samples are in OptimFROG lossless audio format.
This post has been edited by guruboolez: Dec 29 2005, 21:46
Jul 24 2004, 20:49
Group: Super Moderator
Joined: 29-September 01
Member No.: 73
[quote=guruboolez,Jul 24 2004, 03:38 PM][quote=Pio2001,Jul 24 2004, 03:03 PM][quote=guruboolez,Jul 24 2004, 02:37 PM]"well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".
[/quote]Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.[/quote]
Misinformation? Could you be more precise? How many chance do I have to obtain this result by guessing?
In the case of a sequencial ABX test, pval can't be 0.009 for 39 out of 59, since it is the pval for a fixed ABX test. Saying pval = 0.009 is misinformation. The max number of trials must be known and the corrected p-val table must be extended to this number to get the right value.
[quote=guruboolez,Jul 24 2004, 03:38 PM]What are you trying to prove when
- REF vs CODEC_A = 14/16
- REF vs CODEC_B = 18/22
- REF vs CODEC_C = 8/8
and when rating are:
- CODEC_A = 3.5/5
- CODEC_B = 2.0/5
- CODEC_C = 4.5/5
What are you're conclusions here ? I'm interested.
My conclusions are that codec B must be a bit underrated, since an "annoying difference" couldn't be distinguished from the original 4 times (unless the tester states that he hit the wrong button). I don't know how to interpret the ABX scores, since I don't know if they were run in a sequencial of a fixed way. From a fixed point of view, however, the confidence level is high.
[quote=guruboolez,Jul 24 2004, 03:38 PM]No proof of what? If you take a look on log files I've posted, I sometimes add comment about the score's evolution. Apparently, you'e not taking this in account, because you don't know how to compute this situation.
Exactly. I'm not going to spend a whole week-end trying to analyse partly sequencial ABX results with additional conditions, with pages of calculus, while we have a binomial table that gives us the result at once if the number of trials is fixed in advance, especially after most people on this board have hammered (but I'm not sure if I repeated it in the ABX tutorial) the necessity of fixing the number of trials before the test begins OR not looking at them during the test for the results to be valid.
[quote=guruboolez,Jul 24 2004, 03:38 PM][quote][quote=guruboolez,Jul 24 2004, 02:37 PM]That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions.
I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?[/quote]
I'm not talking about 16 vs 24 bit, but about people trying to ABX high bitrate encoding after listening the same disc many, many times.
So what ? Long term or short term doesn't change the methodology... Either the number of trials is fixed, either you don't look at the results until the test is finished, either you fix a maximum number of trials and use the corrected p-val table. The three methods are valid for short or long term tests.
[quote=guruboolez,Jul 24 2004, 03:38 PM]I'd like to see it. Consequence would be funny. Most listening tests already done are simply invalid. Roberto's test should be removed from news, because they are not respecting some scientific conditions for practical reason (pval of 0.01, too few samples, not enough listeners, disparity between critical and easy listeners, etc...
Roberto's results are perfectly valid :
-Tests were double blind
-Pval is strictly inferior to 0.05 (<0.01 is a good thing, <0.05 is requested)
[quote=guruboolez,Jul 24 2004, 03:38 PM]ff123 already pointed out those limits).
The limits pointed out by FF123 have nothing to do with the results of the test, but about the scope of the test. In the same way, your test is valid in itself, because you get a success with p < 0.05, but the scope is very narrow, because you were the only one listening, and it is not sure that someone else would get the same (valid) results. It's like saying "this man is taller than this woman". The test consists in measuring them. The results are :
Man 181 cm
Woman 176 cm.
The right conclusion is "this man is taller than this woman". The results is valid, proven by a repeatable experiment on the same couple of persons. But the scope is very narrow, we can't conclude that every man is taller than any woman.
[quote=guruboolez,Jul 24 2004, 03:38 PM]All HA tacit knowledge should be eradicate, because no proof about MPC superiority agaisnt other contender was NEVER published (but it's a common and shared idea).
You just published it (implicitly) at the top of this thread. Your results is valid, V.A.L.I.D. Can't you read the Anova log I posted and its conclusion ?
Here's the first link I found about Anova searching the web : http://www.psychstat.smsu.edu/introbook/sbk27.htm
The column it talks about refers to another software, and the value discussed is the P-Value.
[quote]If the number (or numbers) found in this column is (are) less than the critical value ( ) set by the experimenter, then the effect is said to be significant. Since this value is usually set at .05, any value less than this will result in significant effects, while any value greater than this value will result in nonsignificant effects.
If the effects are found to be significant using the above procedure, it implies that the means differ more than would be expected by chance alone. In terms of the above experiment, it would mean that the treatments were not equally effective. This table does not tell the researcher anything about what the effects were, just that there most likely were real effects.
If the effects are found to be nonsignificant, then the differences between the means are not great enough to allow the researcher to say that they are different. In that case, no further interpretation is attempted.[/quote]
[quote=guruboolez,Jul 24 2004, 05:50 PM] ABC/HR rating without ABX confirmations are few things... It's a blind test OK, but not a double blind one.
First, yes it is. Your computer is hiding the names of the samples, and you have no other way of finding the reference than your ears. Therefore the test IS double-blind.
A simple blind test would be a listening test between a pressed CD and an original one, for example, with someone putting the CD in the drive to make you listen to it. Listening to what he does with the Cd that he takes from the drive, you might tell if he is replacing the same one in it, or putting it aside and inserting the other. This is a simple blind test. For it to become double blind, you'd have to use 10 identical CD Players, with a CD hidden into it. You're left alone in the room, and must tell which drives have an original and which ones have a copy in it. This is a double blind test. Because you can't be influenced by the operator. Fortunately, computers allow us to hide and play samples without any mean for us to guess which one is played.
[quote=guruboolez,Jul 24 2004, 05:50 PM]Such tests won't be really and genuinely accepted. Look at LAME (3.90.3 vs new realese) testing phase for exemple:
[quote]4. Your test results have to include the following:
* ABX results for
3.90.3 vs. Original
3.96 vs. Original
3.96 vs. 3.90.3
* ABC/HR results are appreciated especially at lower bitrates, but shouldn't be considered a requirement.
* (Short) descriptions of the artifacts/differences[/quote]
Those conditions are requested. Rating without ABX tests are often considered as useless. ABX tests are requested, especially those opposing different encoders each others. So please don't try to say that single ABC ranking are appreciated when other threads or people reaction are showing that without ABX confirmation, these notations are considered as wind...
This is because until now, Roberto was the only one to use Anova anlysis on ABC/HR tests. Remember your last test. You posted some rankings, and they were discussed. I was on the verge of brandishing rule 8, but I rather asked if someone could compute the result and post the graph with bar errors. No one did.
This time you tested Lame vs Vorbis vs MPC at high bitrate. Since I found this test very important, and I saw that no one was capable of computing the results the last time, I read Roberto's pages more carefully, and found FF123's Anova analyzer.
When, rating MPC superior to Mp3 9 times out of 10, you get p <0.05 in ABC/HR Anova analysis, it is mathematically equivalent to succeed in a fixed ABX test with p < 0.05.
The ABC/HR results tell this, not the ABX ones. They show, among other things, that you can consistently hear the difference between MPC and MP3 with the settings you chose, on the samples you chose.
It has not been much pointed out outside Roberto's tests, but ABC/HR can be a substitute for ABX. I think that it's time to explain this in a tutorial. Your test proves the great usefulness of this method of testing, even for one people with several samples, instead of several people and several samples.
It should even work with one people and one sample, but with multiple ABC/HR sessions. I think it should be considered in future ABC/HR software.
Jul 24 2004, 21:43
ABC/HR developer, ff123.net admin
Group: Developer (Donating)
Joined: 24-September 01
Member No.: 12
QUOTE (Pio2001 @ Jul 24 2004, 11:49 AM)
It should even work with one people and one sample, but with multiple ABC/HR sessions. I think it should be considered in future ABC/HR software.
You're talking about mutliple trials of rating a codec in the abc/hr module. For example, rate a certain number of codecs for trial 1, then reshuffle them and rate them again for trial 2. At the end of N trials, one could average the ratings. On the face of it, it would seem the more codecs there are, and the less difference between them, the more benefit one could get from a procedure like this. Imagine testing just two, but very different quality codecs. Then it doesn't make much sense to repeat the ratings: they will be rated exactly the same every time.
So I tend to think that rating more music clips is probably better than trying to get the variability out of the ratings for a single music clip.
|Lo-Fi Version||Time is now: 13th December 2013 - 13:36|