Second in the series of 128 tests |
![]() ![]() |
Second in the series of 128 tests |
Jan 14 2002, 10:17
Post
#26
|
|
![]() Group: Members Posts: 136 Joined: 10-November 01 From: AUS Member No.: 433 |
In biological systems, I suppose it is possible to get unusual sensitivities, freak performances and critical failings. I have read of a human hearing defect where a person hears a different pitch in each ear: to use that subject to develop an audio coding system wouldn't be useful.
It is more useful to look at attributes and responses that can be categorised as standard subject response. To do otherwise would be to study atypical human perception and disease. For developing perceptual audio coding systems, one should be able to identify & categorise artifacts that "typical" listeners will recognise and dislike. I think that ff123 has identified that most of his listening group responded in a similar fashion to the artifacts produced by the codecs. This must represent the standard response to artefacts by the human ear/brain system. There will be some that respond differently, but they would be better pulled from the testing group on the basis of outlier performance. -------------------- Ruse
____________________________ Don't let the uncertainty turn you around, Go out and make a joyful sound. |
|
|
|
Jan 14 2002, 11:12
Post
#27
|
|
![]() Server Admin Group: Admin Posts: 4810 Joined: 24-September 01 Member No.: 13 |
QUOTE Originally posted by Ruse
Why don't you analyse and publish the results without listener 28 for comparison purposes. There must be a statistical validity of some type for excluding "wonky' data. I think the plots you have shown above indicate that listener 28 is an "outllier". Can't you just exclude him on the basis of being more than 2 standard deviations from the mean? No. The analysis that was used doesn't have a concept of 'standard deviation' anyway, and 'removing' data is always a very tricky thing to do, and not even generally accepted as possible in a statitically valid way. Note that this guy would have passed even if post-screening would have been used. He is a valid data point. Us not liking what the data says doesn't change that. -- GCP |
|
|
|
Jan 21 2002, 21:21
Post
#28
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
I've been getting some help from Rich Ulrich in sci.stat.math in identifying outliers, and it appears that the statistic to use is the "corrected item-total correlation," or the (Pearson) correlation of each rater with the average for all the other raters.
For example, using this statistic, Monty has a correlation coefficient of 0.86, and Joerg (listener 28) has a value of -0.81. A large, negative value (near -1.0) indicates a preference that runs highly counter to the the general trend. I will be performing a sub-analysis in the near future for those listeners (there are 9 of them) who are highly and positively correlated. ff123 |
|
|
|
Jan 22 2002, 05:30
Post
#29
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
Subanalysis based on the nine listeners who were highly correlated with each other (r > 0.7). These were the following:
CODE listener r
1 0.86 2 0.95 6 0.80 10 0.86 14 0.84 18 0.82 19 0.96 23 0.86 27 0.92 Resampling analysis as follows: CODE Means:
mpc ogg lame aac wma8 xing 4.63 4.09 3.61 3.36 2.11 2.04 Unadjusted p-values ogg lame aac wma8 xing mpc 0.022* 0.000* 0.000* 0.000* 0.000* ogg - 0.043* 0.003* 0.000* 0.000* lame - - 0.270 0.000* 0.000* aac - - - 0.000* 0.000* wma8 - - - - 0.772 Each '.' is 1,000 resamples. Each '+' is 10,000 resamples .........+ Adjusted p-values ogg lame aac wma8 xing mpc 0.077 0.001* 0.000* 0.000* 0.000* ogg - 0.114 0.011* 0.000* 0.000* lame - - 0.465 0.000* 0.000* aac - - - 0.000* 0.000* wma8 - - - - 0.773 ff123 |
|
|
|
Jan 22 2002, 08:39
Post
#30
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
Going back to dogies.wav, the listener corrected item-total correlations were:
1: 0.63 2: 0.70 3: 0.72 4: 0.71 5: 0.70 6: 0.76 7: 0.69 8: 0.74 9: 0.71 10: 0.70 11: 0.71 12: 0.81 13: 0.73 14: 0.71 All the listeners on this data set were fairly well correlated. ff123 |
|
|
|
Jan 22 2002, 17:37
Post
#31
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
Added the subanalysis to the report, maybe not in time for the latest slashdot discussion, though.
http://ff123.net/128test/interim.html ff123 |
|
|
|
Jan 22 2002, 17:39
Post
#32
|
|
![]() Group: Members Posts: 669 Joined: 15-January 02 From: SE Pennsylvania Member No.: 1032 |
QUOTE CODE Means: mpc ogg lame aac wma8 xing 4.63 4.09 3.61 3.36 2.11 2.04 These results correlate rather closely to my experience with these codecs overall. |
|
|
|
Jan 22 2002, 18:12
Post
#33
|
|
|
Group: Members Posts: 315 Joined: 29-September 01 Member No.: 53 |
This is all very interesting, and this way of outlier removal seems exactly what you would want for developing audio codecs -- what you want to do is to develop something which sounds the best for the normal listener.
FF123, what happens to the significance information when you perform the same procedure on the other samples in your test? |
|
|
|
Jan 22 2002, 23:30
Post
#34
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE FF123, what happens to the significance information when you perform the same procedure on the other samples in your test? Unfortunately, this procedure doesn't work for rawhide.wav. This is kind of strange because I know that at one time rawhide.wav had significant results. I'd guess some sort of factor analysis is needed to pull a cluster of like-preferences out of the noise. I'll post the corrected item-total correlations later today for rawhide.wav and fossiles.wav. ff123 |
|
|
|
Jan 23 2002, 05:58
Post
#35
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
Oops. It does work for rawhide.wav. I made a mistake when calculating the statistic for that file. The correlation coefficients are listed below. If I use the same standard as wayitis, and choose only those listeners satisfying 0.7 < r < 1.0, that would leave me with only two listeners. To get a decent group of listeners, I would have to change the standard and include weakly correlated listeners as well (0.3 < r < 0.7).
1. -0.33 2. 0.36 4. 0.75 5. 0.61 6. 0.49 7. 0.38 8. 0.94 10. 0.54 13. -0.36 14. 0.51 16. 0.06 17. 0.43 18. 0.27 19. 0.54 20. 0.23 21. -0.01 22. 0.18 23. -0.40 24. -0.33 25. 0.01 26. -0.48 If I include all listeners with 0.3 < r < 1.0, the following analysis follows: CODE Read 6 treatments, 10 samples
Unadjusted p-values ogg wma8 mpc lame xing aac 0.679 0.384 0.007* 0.006* 0.000* ogg - 0.646 0.020* 0.018* 0.001* wma8 - - 0.058 0.053 0.002* mpc - - - 0.963 0.201 lame - - - - 0.218 Each '.' is 1,000 resamples. Each '+' is 10,000 resamples .........+ Adjusted p-values ogg wma8 mpc lame xing aac 0.951 0.791 0.053 0.048* 0.001* ogg - 0.951 0.126 0.120 0.004* wma8 - - 0.281 0.278 0.018* mpc - - - 0.960 0.648 lame - - - - 0.648 ff123 |
|
|
|
Jan 23 2002, 06:03
Post
#36
|
|
|
Group: Members Posts: 300 Joined: 3-January 02 From: Santa Cruz, CA Member No.: 891 |
ff123: I'm not sure if I'm reading your statistics correctly; do the wayitis results indicate that with a reasonable degree of certainty aac, ogg, and wma all outperformed both mpc and lame on this sample? Seems a lot different than the results for the other samples, but plausible.
|
|
|
|
Jan 23 2002, 06:14
Post
#37
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE ff123: I'm not sure if I'm reading your statistics correctly; do the wayitis results indicate that with a reasonable degree of certainty aac, ogg, and wma all outperformed both mpc and lame on this sample? Seems a lot different than the results for the other samples, but plausible. for wayitis, for the nine highly correlated listeners, after adjustment for multiple samples, mpc is better than xing ogg is better than xing lame is better than xing aac is better than xing mpc is better than wma8 ogg is better than wma8 lame is better than wma8 aac is better than wma8 mpc is better than aac ogg is better than aac mpc is better than lame with 95% confidence ff123 |
|
|
|
Jan 23 2002, 10:42
Post
#38
|
|
|
Group: Members Posts: 674 Joined: 29-September 01 Member No.: 63 |
ff123, what happens if you consider only the rawhide results from the 9 listeners who "passed" the wayitis results?
|
|
|
|
Jan 23 2002, 16:34
Post
#39
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE what happens if you consider only the rawhide results from the 9 listeners who "passed" the wayitis results? The results wouldn't be as significant as what I posted above. For example, xiphmont has a negative correlation on rawhide. Actually, I'm a bit leery of digging out groups of people this way. Grouping together a bunch of strongly correlated people is one thing (r > 0.7). It's another to pull in weakly correlated people as well. ff123 |
|
|
|
Jan 28 2002, 06:14
Post
#40
|
|
|
Group: Members Posts: 674 Joined: 29-September 01 Member No.: 63 |
What about using this technique for AQ1 results?
|
|
|
|
Jan 28 2002, 06:47
Post
#41
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
I thought about that, but I need to automate the process before I apply it to AQ1. I did the others by hand.
ff123 |
|
|
|
Jan 28 2002, 07:53
Post
#42
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
Ah, what the heck. I was curious.
I found the following correlations by listener, and sorted from most to least correlation (I am listener 6): CODE listener r
6 0.87 20 0.79 17 0.74 1 0.71 34 0.67 13 0.67 7 0.63 30 0.60 15 0.58 37 0.56 11 0.54 41 0.54 35 0.45 9 0.43 16 0.42 10 0.38 4 0.30 18 0.29 39 0.08 2 0.06 14 0.05 38 0.02 25 -0.01 23 -0.07 36 -0.12 29 -0.17 32 -0.56 28 -0.56 If I choose only the 18 listeners with at least weak positive correlation (including listener 18), I get the following results: CODE mpc dm-std dm-xtrm dm-ins cbr256 abr224 r3mix cbr192
4.76 4.63 4.49 4.38 4.36 4.29 4.27 3.81 Unadjusted p-values dm-std dm-xtrm dm-ins cbr256 abr224 r3mix cbr192 mpc 0.379 0.068 0.010* 0.007* 0.002* 0.001* 0.000* dm-std - 0.339 0.087 0.062 0.021* 0.015* 0.000* dm-xtrm - - 0.444 0.359 0.169 0.137 0.000* dm-ins - - - 0.878 0.540 0.467 0.000* cbr256 - - - - 0.646 0.566 0.000* abr224 - - - - - 0.908 0.001* r3mix - - - - - - 0.002* Each '.' is 1,000 resamples. Each '+' is 10,000 resamples .........+ Adjusted p-values dm-std dm-xtrm dm-ins cbr256 abr224 r3mix cbr192 mpc 0.924 0.459 0.120 0.087 0.025* 0.020* 0.000* dm-std - 0.931 0.522 0.445 0.203 0.166 0.000* dm-xtrm - - 0.922 0.922 0.724 0.660 0.000* dm-ins - - - 0.985 0.922 0.922 0.003* cbr256 - - - - 0.941 0.922 0.005* abr224 - - - - - 0.985 0.021* r3mix - - - - - - 0.027* ff123 |
|
|
|
Jan 28 2002, 09:21
Post
#43
|
|
|
Group: Members Posts: 300 Joined: 3-January 02 From: Santa Cruz, CA Member No.: 891 |
Again I seem to have trouble reading these charts, but would it be correct then to say that this analysis does not show any statistically significant difference between MPC, dm-std, and dm-xtrm (on the high end)? Also interesting than the average for dm-std seems to be higher than that for dm-xtrm, though again there's no statistically significant difference (I think?).
|
|
|
|
Jan 28 2002, 10:39
Post
#44
|
|
|
Group: Members Posts: 315 Joined: 29-September 01 Member No.: 53 |
QUOTE Again I seem to have trouble reading these charts The only statistically significant results (after resampling) were: *everything* is better than cbr192 *mpc* is also better than r3mix and abr224. |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 20th June 2013 - 14:33 |