MPC vs OGG VORBIS vs MP3 at 175 kbps, listening test on non-killer samples |
![]() ![]() |
MPC vs OGG VORBIS vs MP3 at 175 kbps, listening test on non-killer samples |
Jul 24 2004, 23:56
Post
#76
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
QUOTE (ff123 @ Jul 24 2004, 09:43 PM) You're talking about mutliple trials of rating a codec in the abc/hr module. For example, rate a certain number of codecs for trial 1, then reshuffle them and rate them again for trial 2. Yes, exactly. QUOTE (ff123 @ Jul 24 2004, 09:43 PM) Imagine testing just two, but very different quality codecs. Then it doesn't make much sense to repeat the ratings: they will be rated exactly the same every time. But in this case, it would provide both the ratings and the ABX results at once, with nearly no more work than for two ABX tests. We just have to find the reference in addition. Recognizing them 8 times out of 8, without ranking reference would replace ABXing the first against the reference, ABXing the second against the reference, but, the ranking being consistent, it would also replace ABXing them between each other without us having to do it ! I tried this data in your analyzer : CODE Reference Codec1 Codec2 5.00 3.01 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 I chose 0.01 as p limit. The Anova analysis (by the way, what's the difference with Friedmann / non parametric ?). The results were Reference is better than Codec2, Codec1 Codec2 is better than Codec1 All this for p < 0.01 Thus the analyzer recognized that having no ranked reference for codec 1 8 times out of 8 meant that Reference is better than Codec 1 with p < 0.01. It recognized that Reference is better than codec 2 with p < 0.01, so far we have the same information that with two ABX tests. And it also says that Codec 2 is better than codec 1 with p < 0.01. This is right since the listener obviously distinguished the codecs (rating codec 1 3.00 and codec 2 4.00) 8 times out of 8 without mistake. By the way, your analyzer is bugged : it doesn't work if the first rating for codec 1 is 3.00. I had to set 3.01 instead. I also tested one mistake in the codec choice (that stands for a 7/8 ABX between the codecs, but still 8/8 for each codec against reference) CODE Reference Codec1 Codec2 5.00 3.01 4.00 5.00 4.00 3.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 5.00 3.00 4.00 The Anova analysis still tells me that codec 2 is better than codec 1 (with p<0.001 ! ). This is strange. However, the Friedman / non parametric analysis detects the problem and says that only the reference was recognized as superior to the codecs with p < 0.01. Hey ! What's the problem with the Anova analysis ?? CODE Reference Codec1 Codec2 5.00 3.01 4.00 5.00 3.00 4.00 It says from the above that Reference is better than Codec2, Codec1, and that Codec2 is better than Codec1, all with p < 0.001 ! It is plain wrong ! The above results can happen by chance ! The Friedmann analysis seems to work well (it says that the above data is not significant). So I ran again Guruboolez data in the analyzer, but with Friedmann analysis, this time, in case of an Anova computation failure : CODE FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/ Friedman Analysis Number of listeners: 10 Critical significance: 0.05 Significance of data: 1.44E-05 (highly significant) Fisher's protected LSD for rank sums: 16.398 Ranksums: MPC-q5 MGX-q6 MP3-V2 MGX-q5.9 MGX-q5.5 MP3-V3 55.00 49.00 31.50 30.50 26.50 17.50 ---------------------------- p-value Matrix --------------------------- MGX-q6 MP3-V2 MGX-q5.9 MGX-q5.5 MP3-V3 MPC-q5 0.473 0.005* 0.003* 0.001* 0.000* MGX-q6 0.036* 0.027* 0.007* 0.000* MP3-V2 0.905 0.550 0.094 MGX-q5.9 0.633 0.120 MGX-q5.5 0.282 ----------------------------------------------------------------------- MPC-q5 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3 MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3 Fortunately, it still says that MPC and Megamix q6 are the winners. However, MPC doesn't win over Megamix q6 anymore. This time, it tells that there is one chance out of two for getting this result by chance ! |
|
|
|
Jul 25 2004, 01:43
Post
#77
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
QUOTE (Pio2001 @ Jul 24 2004, 11:56 PM) So I ran again Guruboolez data in the analyzer, but with Friedmann analysis, this time, in case of an Anova computation failure However, MPC doesn't win over Megamix q6 anymore. This time, it tells that there is one chance out of two for getting this result by chance ! I have simulated the addition of more results (i.e samples). I've just reproduced the scores obtained for the 10 first samples. With 70 results (=10 x 7), the Friedman conclusion: CODE FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/ Friedman Analysis Number of listeners: 70 Critical significance: 0.05 Significance of data: 0.00E+00 (highly significant) Fisher's protected LSD for rank sums: 43.386 Ranksums: MPC-q5 MGX-q6 MP3-V2 MGX-q5.9 MGX-q5.5 MP3-V3 385.00 343.00 220.50 213.50 185.50 122.50 ---------------------------- p-value Matrix --------------------------- MGX-q6 MP3-V2 MGX-q5.9 MGX-q5.5 MP3-V3 MPC-q5 0.058 0.000* 0.000* 0.000* 0.000* MGX-q6 0.000* 0.000* 0.000* 0.000* MP3-V2 0.752 0.114 0.000* MGX-q5.9 0.206 0.000* MGX-q5.5 0.004* ----------------------------------------------------------------------- MPC-q5 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3 MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3 MP3-V2 is better than MP3-V3 MGX-q5.99 is better than MP3-V3 MGX-q5.5 is better than MP3-V3 With 7 times the same bunch of results, MPC can't still be said better than Vorbis -Q 6 with confidence. Even if 56 samples were superior with MPC and only 14 superior with Vorbis... Weird. It's only with 8 times the same results that significance is reached: CODE FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/ Friedman Analysis Number of listeners: 80 Critical significance: 0.05 Significance of data: 0.00E+00 (highly significant) Fisher's protected LSD for rank sums: 46.381 Ranksums: MPC-q5 MGX-q6 MP3-V2 MGX-q5.9 MGX-q5.5 MP3-V3 440.00 392.00 252.00 244.00 212.00 140.00 ---------------------------- p-value Matrix --------------------------- MGX-q6 MP3-V2 MGX-q5.9 MGX-q5.5 MP3-V3 MPC-q5 0.043* 0.000* 0.000* 0.000* 0.000* MGX-q6 0.000* 0.000* 0.000* 0.000* MP3-V2 0.735 0.091 0.000* MGX-q5.9 0.176 0.000* MGX-q5.5 0.002* ----------------------------------------------------------------------- MPC-q5 is better than MGX-q6, MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3 MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3 MP3-V2 is better than MP3-V3 MGX-q5.99 is better than MP3-V3 MGX-q5.5 is better than MP3-V3 Now, if I suppose that the following scores I've initially planned to add to the first bunch of 10 results will not really differ from the 10 first, I need to find and test about 70 additional samples to claim that MPC is superior to vorbis "megamix" -q 6,00 without risking the banishment. Forget guruboolez's test: I've other things to do in my life With ANOVA analysis, the situation is less pathetic: CODE FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/ Blocked ANOVA analysis Number of listeners: 20 Critical significance: 0.05 Significance of data: 0.00E+00 (highly significant) --------------------------------------------------------------- ANOVA Table for Randomized Block Designs Using Ratings Source of Degrees Sum of Mean variation of Freedom squares Square F p Total 119 90.75 Testers (blocks) 19 7.35 Codecs eval'd 5 52.03 10.41 31.50 0.00E+00 Error 95 31.38 0.33 --------------------------------------------------------------- Fisher's protected LSD for ANOVA: 0.361 Means: MPC-q5 MGX-q6 MGX-q5.9 MP3-V2 MGX-q5.5 MP3-V3 3.82 3.15 2.34 2.30 2.23 1.88 ---------------------------- p-value Matrix --------------------------- MGX-q6 MGX-q5.9 MP3-V2 MGX-q5.5 MP3-V3 MPC-q5 0.000* 0.000* 0.000* 0.000* 0.000* MGX-q6 0.000* 0.000* 0.000* 0.000* MGX-q5.9 0.826 0.546 0.013* MP3-V2 0.701 0.023* MGX-q5.5 0.057 ----------------------------------------------------------------------- MPC-q5 is better than MGX-q6, MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3 MGX-q6 is better than MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3 MGX-q5.99 is better than MP3-V3 MP3-V2 is better than MP3-V3 If the next 10 samples I'll test have the same notation as the 10 first one, then I could conclude about mpc superiority. May I suggest to forget the "Friedman/non-parametric Fisher" analysis for analysing ABCHR scores? Could be helpful for testers... |
|
|
|
Jul 25 2004, 06:52
Post
#78
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (Pio2001 @ Jul 24 2004, 02:56 PM) I chose 0.01 as p limit. The Anova analysis (by the way, what's the difference with Friedmann / non parametric ?). Non-parametric means that you're giving each codec a ranking (i.e., first, second, third, etc.) instead of a rating on a scale from 1.0 to 5.0. Ranking can be more robust than rating, but also less sensitive. QUOTE By the way, your analyzer is bugged : it doesn't work if the first rating for codec 1 is 3.00. I had to set 3.01 instead. You're running into a divide by 0 problem. If you set any number to be different (not just codec 1 in the first row) it will sidestep this problem. It's not a bug in the program -- that's the way the calculations work. If you use real data, you should never see this kind of behavior. QUOTE I also tested one mistake in the codec choice (that stands for a 7/8 ABX between the codecs, but still 8/8 for each codec against reference) CODE <!--QuoteEBegin-->Reference Codec1 Codec2<!--QuoteEBegin-->5.00 3.01 4.00<!--QuoteEBegin-->5.00 3.00 4.00<!--QuoteEBegin-->5.00 3.00 4.00<!--QuoteEBegin-->5.00 3.00 4.00<!--QuoteEBegin-->5.00 3.00 4.00<!--QuoteEBegin-->5.00 3.00 4.00 <!--QuoteEBegin-->5.00 3.00 4.00<!--QuoteEBegin-->5.00 3.00 4.00<!--QuoteEBegin--> The Anova analysis still tells me that codec 2 is better than codec 1 (with p<0.001 ! ). This is strange. Also not a bug. Set another row to be like row 2 and you'll see the p-value start to creep up. ff123 This post has been edited by ff123: Jul 25 2004, 06:56 |
|
|
|
Jul 25 2004, 07:05
Post
#79
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (guruboolez @ Jul 24 2004, 04:43 PM) May I suggest to forget the "Friedman/non-parametric Fisher" analysis for analysing ABCHR scores? Could be helpful for testers... The Friedman non-parametric analysis makes fewer assumptions about the data, and is therefore more robust, but can also be less powerful than ANOVA. If one wanted to be ultra-conservative, he would do a non-parametric Tukey's analysis, which corrects for the fact that there are multiple codecs being ranked. But for abc/hr, there's little reason to use friedman. I should probably change the default. ff123 Edit: I should also probably add the Tukey's analyses back in to the web page. They're in the actual command line program. This post has been edited by ff123: Jul 25 2004, 07:09 |
|
|
|
Jul 25 2004, 16:45
Post
#80
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
QUOTE (ff123 @ Jul 25 2004, 06:52 AM) QUOTE By the way, your analyzer is bugged : it doesn't work if the first rating for codec 1 is 3.00. I had to set 3.01 instead. You're running into a divide by 0 problem. If you set any number to be different (not just codec 1 in the first row) it will sidestep this problem. It's not a bug in the program -- that's the way the calculations work. If you use real data, you should never see this kind of behavior. Why ? Is it forbidden to find one codec consistently rated 3 and the other always 4 ? I didn't set 0 anywhere, just 3.00 and 4.00. EDIT : and I still don't understand how two people rating codecs 3 and 4 can lead to a confidence superior to 99.9 % ! I guess that the analyzer finds the coincidence very big : 3.00 and 3.00 again, while it could have been 3.05, or 2.99, that can't be a coincidence ! I should absolutely be avoided ! Real People will never rate a codec 2.99. Will the analyzer drop the accuracy if we set 3 without dot and digits ? Or will it have to be rewrited in order to work with integer precision instead of one hundredth ? |
|
|
|
Jul 25 2004, 17:10
Post
#81
|
|
|
Group: Members Posts: 273 Joined: 9-August 03 From: MI, USA Member No.: 8257 |
With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence. That's how the test works. However, such an assumption is terrible with a sample size that low, so trying to run the test with only two people ranking codecs is a bad idea.
Also, the division by zero came from the fact that all scores were the same (mean - each score = 0). Again, this is either a symptom of the sample size being too low or the probability of a real difference being staggeringly high. |
|
|
|
Jul 25 2004, 18:38
Post
#82
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
QUOTE (bleh @ Jul 25 2004, 05:10 PM) With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence. That's how the test works. In this case, it should never be applied to ABC/HR tests. We ask people to choose between 1.00, 2.00, 3.00, 4.00, or 5.00. The analyzer will find that people always giving an integer answer can't be a coincidence, and will return insanely high levels of confidence because of this. QUOTE (bleh @ Jul 25 2004, 05:10 PM) However, such an assumption is terrible with a sample size that low, so trying to run the test with only two people ranking codecs is a bad idea. What's the meaning of the p values then ? |
|
|
|
Jul 25 2004, 21:58
Post
#83
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (Pio2001 @ Jul 25 2004, 09:38 AM) QUOTE (bleh @ Jul 25 2004, 05:10 PM) With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence. That's how the test works. In this case, it should never be applied to ABC/HR tests. We ask people to choose between 1.00, 2.00, 3.00, 4.00, or 5.00. The analyzer will find that people always giving an integer answer can't be a coincidence, and will return insanely high levels of confidence because of this. No, not true. One of the assumptions that ANOVA makes is that the scale is continuous. ABC/HR's scale is not continuous, but it is close enough, since it has many intervals in between the major divisions. As I said, in real-world data, you are not likely to see a table of scores like the one you posted. ff123 |
|
|
|
Jul 25 2004, 22:20
Post
#84
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
For anybody interested in seeing exactly where in the calculations this thing blows up with the sort of data Pio supplied, download this spreadsheet, which shows how things are computed:
http://ff123.net/export/anova.zip ff123 |
|
|
|
Jul 26 2004, 00:53
Post
#85
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
In the meantime, I found some infos on the web :
Anova : http://www.psychstat.smsu.edu/introbook/sbk27.htm Friedman : http://www.graphpad.com/articles/interpret...A/friedmans.htm In short, it says that the Friedman analysis only cares about the ranking of the samples in each line. If, in one line, codec 1 is rated 5.00 and codec 2 4.99, for the Friedman analysis, it is exactly the same thing as if they were rated 2000 and 1, as long as codec 1 is first and codec 2 second. It doesn't care about the scores at all. The anova analysis computes the variance of the results that each codec got. Then it computes the variations between the codecs. If it finds the variation between codecs being abnormally high compared to the variance of the ratings of the codecs, it tells that the codec is superior, or inferior. If it finds that the difference between the codecs is similar to the differences between each people or sample, it says that the variation was to be expected, and rates the codecs equal. |
|
|
|
Jul 26 2004, 02:38
Post
#86
|
|
|
Group: Members Posts: 51 Joined: 29-June 03 Member No.: 7443 |
It is interesting to see, that applying scientific methods, makes some fine detail to disappear from the results and wastes the effor was put in the test. Looking at charts of test results, like the latest one of low bitrate, even if the confidence intervals overlap, we still rate one codec better over the other, even that the probablility of being wrong increases. Without violating rule #8, we do make comments on it.
It has been discussed several times how to deal with differences in bitrate of the samples. I have not made an effort to find much about it and it is controversal because it is subjective, but may I suggest an XY chart of bitrate vs. rating of this result? A "how much bang for the money" style. Maybe others aware of some scientific way of calculating mean/std circles for each codec regardless, that quality is not linear even not proportional to size/bitrate. Could that give another perspective of the results? |
|
|
|
Jul 26 2004, 03:10
Post
#87
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (deaf @ Jul 25 2004, 05:38 PM) It is interesting to see, that applying scientific methods, makes some fine detail to disappear from the results and wastes the effor was put in the test. For a group test, to get more sensitive results, decrease the number of codecs being tested. If you only compare 2 codecs, for example, you can get very fine detail. ff123 |
|
|
|
Jul 26 2004, 03:40
Post
#88
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (ff123 @ Jul 24 2004, 10:05 PM) Edit: I should also probably add the Tukey's analyses back in to the web page. They're in the actual command line program. Added to the web page analyzer. I also made the Parametric Tukey's HSD the default, which is the conservative option, but the most statistically correct, especially with large numbers of codecs being compared. ff123 |
|
|
|
Aug 4 2004, 23:23
Post
#89
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
FF123, could you explain us in common language why when two codecs are analyzed in a Friedmann way, we find the confidence that a difference exists match the binomial table, while adding completely independant columns, standing for other codecs, the exact same data between our two first codecs becomes unsignificant ?
Is it because of the probability of having a low probability of guessing among all possible pairs of codecs is taken into account ? |
|
|
|
Aug 5 2004, 01:04
Post
#90
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (Pio2001 @ Aug 4 2004, 02:23 PM) FF123, could you explain us in common language why when two codecs are analyzed in a Friedmann way, we find the confidence that a difference exists match the binomial table, while adding completely independant columns, standing for other codecs, the exact same data between our two first codecs becomes unsignificant ? Is it because of the probability of having a low probability of guessing among all possible pairs of codecs is taken into account ? The answer to the latter question: the Friedman (non-parametric) method does not do a Bonferroni-type correction for multiple comparisons (like the Tukey methods do). I don't really know the answer to the first, but I can guess: there would have to be a separate LSD number for each comparison (for 2 codecs there can only be one comparison, for 3 codecs there are 3 comparisons, for 4 codecs 6 comparisons, etc.). Since there is only one LSD number, all of the comparisons would have to be exactly alike to match the binomial table. But that would almost never happen. The way to get a better match to the binomial table would be to do a comparison like the resampling method used by the bootstrap program here: http://ff123.net/bootstrap/ This method essentially performs many simulations, and produces a separate confidence interval for each comparison. The downside to using this type of method is that you can't really use the nice graphs any more (which we can draw because there is only one size error bar which applies to all comparisons), and have to stick to showing the results in tabular format. ff123 |
|
|
|
Aug 22 2004, 13:53
Post
#91
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
...::: 8 additional results :::...
I. TESTING CONDITIONS Few changes since last bunch of test: same hardware, same kind of music (classical), same software. I’ve nevertheless drawn the conclusion of past discussion with pio2001, and fixed a number of trials for all ABX test: 12 trials, no more, no less. This drastic condition implies a lot of concentration, many rests, and is therefore very time-consuming. Tests are less enjoying in my opinion (motivation is harder to find). Other consequence of this: there are now 5.0 [transparent] notation. If I failed [EDIT: "completely failed"] to ABX something, I cancelled my ABC/HR notation and gave a nice 5.0 as final note. I nevertheless kept trace of my initial feeling in the "general comment". II. SAMPLES I tried to vary as much as possible the samples (kind of instruments/signal). There aren't known-killers. All samples should be ‘normal’, with no correspondences to typical lossy/perceptual problems (as sharp attacks and micro-attacks signal for exemple). Eight more samples. Two are from harashin: - Liebestod: opera (soprano voice with orchestra) - LadyMacbeth: festive orchestra, with predominant brass and cymbals Six others are mine: - Trumpet Voluntar: trumpet with organ (noisy recording) - Vivaldi RV93: baroque strings, i.e period instruments (small ensemble) - Troisičme Ballet: cousin of bagpipes, playing with a baroque ensemble - Vivaldi – Bassoon [13]: solo bassoon, with light accompaniment - Seminarist: male voice (baritone) with a lot of sibilant consonants and piano accompaniment - ButterflyLovers: solo violin playing alternately with full string orchestra III. RESULTS 3.1. eight new results ![]() 3.2 cumulative results ![]() 3.3. comments about results No big differences between the two parts of the test: CODE TEST MP3_V2 MP3_V3 MPC_Q5 MGX5,5 MGX5,99 MGX6,00 NO.1 12,3 11,9 13,8 12,2 12,3 13,2 NO.2 12,7 11,9 13,9 12,1 12,3 13,4 Average notation is very stable, except maybe for lame --preset standard, in slight progress for these eight new samples. Hierarchy is identical. The conclusions are therefore the same as those posted in my first post IV. STATISTICAL ANALYSIS I fed ff123’s friedman.exe application with the following table: CODE LAME_V2 LAME_V3 MPC_Q5 OGG5.5 OGG5.99 OGG6.00 2.00 1.50 3.00 2.00 2.00 3.20 1.50 1.00 4.00 2.90 2.90 3.50 3.00 2.50 2.80 3.00 3.30 4.00 3.00 2.00 4.00 2.00 2.00 2.30 1.50 1.00 4.90 2.50 2.50 3.30 3.00 1.80 3.80 2.20 2.40 3.00 1.50 1.20 3.50 1.80 2.30 3.40 1.50 2.70 4.00 2.00 2.00 2.30 3.00 2.80 4.20 1.60 1.50 3.00 3.00 2.30 4.00 2.30 2.50 3.50 2.00 2.00 4.00 2.50 2.50 3.50 3.50 2.50 5.00 1.50 1.50 4.00 1.50 1.00 4.00 2.00 2.50 3.00 1.40 1.20 3.50 1.70 2.00 2.20 4.00 3.00 5.00 4.00 4.00 4.50 2.50 1.30 3.50 1.70 1.70 2.70 3.00 1.20 3.00 1.40 2.00 2.20 3.50 3.00 3.00 2.00 2.00 5.00 [interesting to note: the conclusions and values computed by the tool are exactly the same if I keep the original notation [e.g. 12.3 and not 2.30]. The ANOVA analysis conclusion is: CODE FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/ Blocked ANOVA analysis Number of listeners: 18 Critical significance: 0.05 Significance of data: 0.00E+000 (highly significant) --------------------------------------------------------------- ANOVA Table for Randomized Block Designs Using Ratings Source of Degrees Sum of Mean variation of Freedom squares Square F p Total 107 102.73 Testers (blocks) 17 23.75 Codecs eval'd 5 49.48 9.90 28.53 0.00E+000 Error 85 29.49 0.35 --------------------------------------------------------------- Fisher's protected LSD for ANOVA: 0.390 Means: MPC_Q5 OGG6.00 LAME_V2 OGG5.99 OGG5.5 LAME_V3 3.84 3.26 2.47 2.31 2.17 1.89 ---------------------------- p-value Matrix --------------------------- OGG6.00 LAME_V2 OGG5.99 OGG5.5 LAME_V3 MPC_Q5 0.004* 0.000* 0.000* 0.000* 0.000* OGG6.00 0.000* 0.000* 0.000* 0.000* LAME_V2 0.430 0.137 0.004* OGG5.99 0.481 0.034* OGG5.5 0.153 ----------------------------------------------------------------------- MPC_Q5 is better than OGG6.00, LAME_V2, OGG5.99, OGG5.5, LAME_V3 OGG6.00 is better than LAME_V2, OGG5.99, OGG5.5, LAME_V3 LAME_V2 is better than LAME_V3 OGG5.99 is better than LAME_V3 And now, the “most statistically correct” tukey-parametric analysis one: CODE FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/ Tukey HSD analysis Number of listeners: 18 Critical significance: 0.05 Tukey's HSD: 0.574 Means: MPC_Q5 OGG6.00 LAME_V2 OGG5.99 OGG5.5 LAME_V3 3.84 3.26 2.47 2.31 2.17 1.89 -------------------------- Difference Matrix -------------------------- OGG6.00 LAME_V2 OGG5.99 OGG5.5 LAME_V3 MPC_Q5 0.589* 1.378* 1.533* 1.672* 1.956* OGG6.00 0.789* 0.944* 1.083* 1.367* LAME_V2 0.156 0.294 0.578* OGG5.99 0.139 0.422 OGG5.5 0.283 ----------------------------------------------------------------------- MPC_Q5 is better than OGG6.00, LAME_V2, OGG5.99, OGG5.5, LAME_V3 OGG6.00 is better than LAME_V2, OGG5.99, OGG5.5, LAME_V3 LAME_V2 is better than LAME_V3 According to the last analysis, lame –V3 and vorbis megamix1 –q 5,50/5,99 offers comparable performances (they are tied). In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. If I understand correctly, for me and the set of 18 tested samples, I should admit that lame is tied with vorbis even if this last one is superior on 72% of the tested samples! It’s totally insane in my opinion… There's maybe a problem somewhere, or are 18 samples still not enough? The ANOVA analysis is slightly more acceptable: it concludes on megamix 5,99 superiority for the 18 samples, but still not on megamix 5,50 one (66% of favorable samples). But both analysis concludes on: 1/ full MPC -Q5 superiority (even against Vorbis megamix1 -Q6 2/ megamix1 Q6 superiority on lame -V2 and V3 and on megamix Q5,50 and Q5,99 3/ LAME V2 > LAME V3 More schematically: • ANOVA: MPC_Q5 > OGG_Q6 > OGG_Q5,99/Q5,00/MP3_V2/MP3_V3 • ANOVA: OGG_Q5,99 > LAME V3 • ANOVA: LAME_V2 > LAME V3 • TUKEY_PARAMETRIC: MPC_Q5 > OGG_Q6 > OGG_Q5,99/Q5,00/MP3_V2/MP3_V3 • TUKEY_PARAMETRIC: LAME_V2 > LAME V3 In other words, it means that for me, and after double blind tests on non-critical material: - musepack --standard superiority is not a legend, and isn't infirmed by recent progress made by lame developers and vorbis people. - lame --standard preset is still competitive against vorbis, at least up to q5,99, which still suffers from audible and sometimes irritating coarseness. Nevertheless, quality of lame MP3 quickly drops below this standard preset. It's interesting to note, in case of hardware playback. - vorbis aoTuV/CVS 1.1 begins to be suitable for high quality at q 6,00, but absolutely not below this floor. APPENDIX. SAMPLE LOCATION AND ABX LOGS ABX logs are available here: http://audiotests.free.fr/tests/2004.07/hq1/log/ The eight new log files are merged in one single archive Samples are not uploaded. I could do it. Is someone interested? This post has been edited by guruboolez: Dec 29 2005, 21:51 |
|
|
|
Aug 22 2004, 14:31
Post
#92
|
|
![]() Group: Members Posts: 265 Joined: 15-December 03 Member No.: 10452 |
So many numbers...ouch.
Thany you Guruboolez for all your work. There is definitely a shortage of encoder comparison tests at relatively high bitrates, and no shortage of opinions. One thing continues to bug me: Someone with really good hearing, including the training to listen for artifacts, can do a valid abx comparison and produce results at a good confidence level. Someone like me can not. Am I better off using the encoder that the person with good hearing can identify? In other words, even if I can not objectively identify the differences in abx testing, is there some subjective additional level of enjoyment of the music, other than a possible placebo effect? Is there any way to verify this? There is the final unfortunate truth: MP3 hardware support is universal, Ogg Vorbis hardware support is relatively limited along with the battery life issue, and MPC is confined to playback on computers. |
|
|
|
Aug 22 2004, 15:16
Post
#93
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (guruboolez @ Aug 22 2004, 04:53 AM) According to the last analysis, lame –V3 and vorbis megamix1 –q 5,50/5,99 offers comparable performances (they are tied). In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. If I understand correctly, for me and the set of 18 tested samples, I should admit that lame is tied with vorbis even if this last one is superior on 72% of the tested samples! It’s totally insane in my opinion… There's maybe a problem somewhere, or are 18 samples still not enough? I verified with the bootstrap program: http://ff123.net/bootstrap/ that statistically speaking, if you adjust for the fact that there are actually 15 comparisons with 6 codecs, then ogg5.99 must be considered tied to lamev3. The bootstrap (simulation) program is almost as good as one can do for adjusted p-values. Nice comparison, guru. CODE Adjusted p-values OGG6.00 LAME_V2 OGG5.99 OGG5.5 LAME_V3 MPC_Q5 0.021* 0.000* 0.000* 0.000* 0.000* OGG6.00 - 0.001* 0.000* 0.000* 0.000* LAME_V2 - - 0.633 0.367 0.021* OGG5.99 - - - 0.633 0.128 OGG5.5 - - - - 0.367 Means MPC_Q5 OGG6.00 LAME_V2 OGG5.99 OGG5.5 LAME_V3 3.844 3.256 2.467 2.311 2.172 1.889 ff123 |
|
|
|
Aug 22 2004, 17:28
Post
#94
|
|
![]() Group: Developer (Donating) Posts: 193 Joined: 9-May 02 From: Emeryville, CA Member No.: 2010 |
Thanks very much for taking all the time to do these comparisons Guru. So much has changed since all the original high-bitrate comparisons were made that it's very useful to get new data. I guess I'll continue using mpc myself.
|
|
|
|
Aug 23 2004, 00:44
Post
#95
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
Thank you very much for you work and analysises !
QUOTE (guruboolez @ Aug 22 2004, 02:53 PM) In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. [...]It’s totally insane in my opinion… I don't see what's wrong with it. If you interpret is a an ABX test, you got Megamix superior to Lame with a score of 13/18. The p value is 0.048, which is already very borderline for a valid result. But here, 6 codecs are compared, which gives a total of 15 possible codecs comparisons. If you are answering at random, it is perfectly expectable to get, among 15 possible 1-to-1 codecs comparisons, one of them positive, with p=1/15. This would be considered complete chance, with p clearly higher than 0.5, and not 1/15. In the same way, the 13/18 result that you got has not a probability of being guessed equal to 0.048, but much higher. It says that it is equal to 0.633. So if this can happen more than one time out of two, I should be alble to reproduce it easily with random results. First try with completely random numbers, generated by my calculator : CODE Joke1 Joke2 Joke3 Joke4 Joke5 Joke6 3.40 1.70 4.70 1.30 2.10 1.10 4.70 4.70 1.60 3.60 2.30 1.20 3.70 1.90 2.50 1.10 2.50 4.30 2.50 3.10 3.90 3.40 2.40 4.00 1.30 4.60 4.40 3.40 1.50 2.50 4.00 1.20 2.40 4.90 4.30 1.50 3.40 2.50 4.50 1.40 3.10 2.00 1.20 3.30 4.50 4.10 2.50 1.90 4.50 4.30 4.70 4.70 5.00 4.30 4.50 4.10 3.10 4.50 2.60 3.40 2.60 2.30 1.80 4.80 3.00 1.90 2.40 2.20 2.10 4.00 2.60 2.80 1.20 1.80 1.10 1.10 3.90 3.30 4.90 1.30 2.40 4.60 4.20 2.20 2.50 2.10 4.70 4.00 4.80 1.50 1.90 3.10 3.80 1.50 3.90 2.80 1.30 4.70 3.40 3.10 3.20 2.70 4.30 2.30 3.70 1.80 1.30 4.10 No score as good as 13/18. Second try : CODE Joke1 Joke2 Joke3 Joke4 Joke5 Joke6 14 31 50 17 34 29 13 22 21 36 23 23 50 48 17 31 14 11 28 49 24 50 43 50 12 48 23 33 22 43 40 28 25 15 47 33 23 13 37 29 38 30 41 40 19 25 33 18 28 48 40 12 13 44 32 25 40 26 49 17 11 29 43 15 36 47 41 18 22 22 24 44 15 13 25 13 39 48 16 17 17 40 37 24 30 29 49 29 12 43 33 40 14 49 42 48 19 47 11 47 40 31 42 34 41 24 25 21 Here, you can see that Joke6 is better than Joke1 13 times out of 18, and with random numbers ! This is not an insane result. Two tries were enough for it to happen. |
|
|
|
Aug 23 2004, 09:42
Post
#96
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
I don't understand what these random numbers should proove.
I've tested some codecs with 18 samples. By comparing two of these encoders, I saw that one is inferior to the other on 78% of the tested sample, and 'identical' on 6%. It should be very obvious that one is ABSOLUTELY inferior to the second, at least on the 18 tested samples. |
|
|
|
Aug 23 2004, 11:58
Post
#97
|
|
|
Moderator Group: Super Moderator Posts: 3934 Joined: 29-September 01 Member No.: 73 |
If someone runs an ABX test of which I am the listener, and he plays 18 times A, and I say 78% of the times that it is B, is it obvious that B was absolutely played 78 % of the times ?
But again, this discussion only matters for the interpretation of the Anova and Tukey analyses. But here, you got some ABX results, whose meaning goes much beyond what Anova and Tukey says. We must consider the ABX results separately from the ABC/HR analyses, in order to draw a general conclusion. |
|
|
|
Aug 23 2004, 16:03
Post
#98
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (guruboolez @ Aug 23 2004, 12:42 AM) I don't understand what these random numbers should proove. I've tested some codecs with 18 samples. By comparing two of these encoders, I saw that one is inferior to the other on 78% of the tested sample, and 'identical' on 6%. It should be very obvious that one is ABSOLUTELY inferior to the second, at least on the 18 tested samples. The adjustment for multiple samples can be harsh. That's why it's good to keep the number of comparisons down to a minimum. If you had just compared Ogg5.99 against lameV3, it's likely you would have come up with a significant difference. But with so many comparisons, the statistical "noise" gets larger. ff123 |
|
|
|
Aug 23 2004, 19:58
Post
#99
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
QUOTE (ff123 @ Aug 23 2004, 04:03 PM) But is it really to the tester to adapt his test to the analysis tool? Or isn't it more logical to ask to the analysis tool to deal with the conditions of the test? It's sounds like methodological problems introduced with VBR tests at a target bitrate: there's sometimes a big temptation to select specific samples (not too high, not to low) in order to match the targeted bitrate, rather than choosing the samples we really want to test, which could be more interesting. If a tester choose to avoid some samples for this reason, the risk is to limit the impact (and maybe the significance) of the test. Same thing here. It's probably better to limit the number of comparison for many reasons. But on the other side, it'll be harder to have solid ideas about relative performances of different encoders. With my test for exemple, I have now solid ideas about: - big difference existing between vorbis -q6 and lower profile, including 5,99 - very limited difference between vorbis 5,50 and 5,99 (therefore, there's few thing to expect by increasing bitrate by 0.2...0.5 level) - serious differences between lame --preset standard and -V2 If I had removed three contenders, keeping one lame setting and one vorbis quality level, the three previous conclusions wouldn't be possible. And if I had tested separately vorbis and lame in two different session, I couldn't seriously compare both results each others (such comparisons need at least the same low and high anchor, which make two separate tests with three contenders each + 2 anchor much longer than one single test with 6 contenders). In other word, I don't think that it would be a good idea to adapt the conditions of any test to the conditions of the analysis tool. The analysis must be passive, without any incidence on the subject of the analysis. |
|
|
|
Aug 23 2004, 21:35
Post
#100
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (guruboolez @ Aug 23 2004, 10:58 AM) In other word, I don't think that it would be a good idea to adapt the conditions of any test to the conditions of the analysis tool. The analysis must be passive, without any incidence on the subject of the analysis. That's fine. A tester can set up any test he likes, but the fact is that the test conditions affect the subsequent analysis. So you've got to be aware of this when you set up your test. In this particular case, if you really wanted to be certain that ogg5.99 is really better than lameV3 (for your ears and samples), then you should run another test with just the two codecs to confirm it. That's the way statistics works. You go into a test with your criteria for significance set prior to running the test (meaning you should choose which analysis you're going to run prior to the test as well; i.e., ANOVA or Tukey's, etc.). And then live with the results. ff123 |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 23rd May 2013 - 06:28 |