Help - Search - Members - Calendar
Full Version: Statistical Methods for Listening Tests(splitted R3mix VBR s
Hydrogenaudio Forums > Hydrogenaudio Forum > General Audio
Pages: 1, 2
ff123
QUOTE
Hmm, the results of that test are still under discussion (actually, I'm waiting for ff123 to finish his analysis tool with the nonparametric Tukey HSD test )


Well, you don't have to wait for me to finish coding to know what the non-parametric Tukey HSD value is -- I calculated that in Excel. It's 64. The Fisher LSD was 44. So, you can see that Tukey is quite a bit more conservative.

The ranksums (for reference) were:

cbr192 = 151.5
r3mix = 172.0
abr224 = 186.5
dm-ins = 188.0
cbr256 = 185.5
dm-std = 198.0
dm-xtrm = 207.0
mpc = 223.5

So basically all the Tukey HSD says (experiment-wise confidence level is 95%) is that mpc is better than cbr192!

ff123

Edit: I discovered my Excel spreadsheet had a mistake in it. The non-parametric Tukey's HSD should be 68.1. I was debugging my code and had to resolve the discrepancy (the code was correct). The conclusion remains the same.
Garf
QUOTE
Originally posted by ff123

Well, you don't have to wait for me to finish coding to know what the non-parametric Tukey HSD value is -- I calculated that in Excel.  It's 64.  The Fisher LSD was 44.  So, you can see that Tukey is quite a bit more conservative.


But the Fisher LSD isn't simultaneous is it? Or was it based on a normal distribution?

(I remember that we talked about it and I concluded that it wasn't reliable/applicable, but I don't remember why)

I wanted a statistically 'sound' conclusion from this test. I wouldn't call soundness conservative.

For an idea of the individual results the Wilcoxon S-R test was enough. (From a look at the values its sensitivity seems to be even better than the Fisher LSD?) But presenting a result and having to say: there >50% chance one of the things we concluded is incorrect isn't very nice is it?

(btw. Wilcoxon+Bonferroni correction gave in the end the same results as the nonparam Tukey HSD!)


QUOTE

So basically all the Tukey HSD says (experiment-wise confidence level is 95%) is that mpc is better than cbr192!


Hmm, in the next test we will have to set in advance what we want to test I guess. And preferably that should only be like 4 or 5 pairs or so.

--
GCP
ff123
QUOTE
But the Fisher LSD isn't simultaneous is it? Or was it based on a normal distribution?


The Fisher LSD I use for the Friedman analysis is a non-parametric version (which doesn't assume normal distribution). There is a different Fisher LSD I use for blocked ANOVA.

Both are one-at-a-time multiple comparison techniques. I guess that seems like an oxymoron, but I believe the reason why it's used (as opposed to the Wilcoxon) is that once you've gone to the trouble of calculating the rank sums for the Friedman test, you might as well use those values to perform the Fisher test. And the reason the Friedman or ANOVA tests are performed first instead of going straight to the Wilcoxon is to make sure that there is at least one significant difference of means somewhere in the experiment. It'd be a waste of time to perform all those Wilcoxons and find out after the fact that ANOVA or Friedman says that the difference in means was just statistical noise.

So my question would be: For one-at-a-time comparisons, is it preferable to use Wilcoxon or to use the Fisher LSD? If the only rationale for using the Fisher LSD is convenience of calculation, but the Wilcoxon is more sensitive, then I'd rather use the latter -- let the software take care of laborious calculations. And for simultaneous comparisons, is it preferable to use Bonferroni-corrected Wilcoxon, Bonferroni-corrected Fisher LSD, or Tukey's HSD?

I think you're saying, Garf, that the Wilcoxon might be the way to go for one-at-a-time tests, but perhaps the Tukey HSD would be best for simultaneous tests.

Oh, and I agree that the objectives of a test should be clearly stated up front, *before* the test is performed, and that if any relationships are not of interest, they should be excluded. Maybe the best way to do this is to perform two types of experiments: exploratory ones and confirmatory ones. The exploratory ones could give a general idea of what all the relationships look like, and the confirmatory ones would test specific ones, for example dm-ins versus dm-xtrm. The implication is that the finer the distinction is you want to make, the fewer codecs should be involved.

ff123
Garf
QUOTE
Originally posted by ff123

Both are one-at-a-time multiple comparison techniques. I guess that seems like an oxymoron, 


Yes. I didn't understand it at first. (Now I do, thanks to your explanation)

QUOTE

but I believe the reason why it's used (as opposed to the Wilcoxon) is that once you've gone to the trouble of calculating the rank sums for the Friedman test, you might as well use those values to perform the Fisher test.


This seems very plausible, given that most of these methods predate computers smile.gif

QUOTE

And the reason the Friedman or ANOVA tests are performed first instead of going straight to the Wilcoxon is to make sure that there is at least one significant difference of means somewhere in the experiment.  It'd be a waste of time to perform all those Wilcoxons and find out after the fact that ANOVA or Friedman says that the difference in means was just statistical noise.


Actually, I would expect the Wilcoxon+Bonf corr/Fisher+Bonf corr/Tukey tests all to give nothing if the Friedman test fails. (Wouldn't there be a contradiction otherwhise?)

QUOTE

So my question would be:  For one-at-a-time comparisons, is it preferable to use Wilcoxon or to use the Fisher LSD?  If the only rationale for using the Fisher LSD is convenience of calculation, but the Wilcoxon is more sensitive, then I'd rather use the latter -- let the software take care of laborious calculations.


I honestly wouldn't know. I'm a bit biased vs the Wilcoxon because the statisticans told me it was good for our purposes, so I know it's good, whereas I don't know the Fisher LSD. I think that you might be right in the fact that the Fisher LSD is for convenience of calculation.

On the other hand, you've already written the app, so perhaps you can just use the Fisher LSD results vs the Wilcoxon results and check which one is more sensitive? We can just use that one then. The SPSS output is still on my page : http://home.planetinternet.be/~pascutto/AQT/OUTPUT.HTM

Also, there shouldn't be any contradictions between the two.

QUOTE

And for simultaneous comparisons, is it preferable to use Bonferroni-corrected Wilcoxon, Bonferroni-corrected Fisher LSD, or Tukey's HSD?


Tukey HSD, no question. It should _always_ be more sensitive than the other methods. It basically does a smarter 'correction' than the very conservative Bonferroni.

QUOTE

I think you're saying, Garf, that the Wilcoxon might be the way to go for one-at-a-time tests, but perhaps the Tukey HSD would be best for simultaneous tests.


Yes. (But I'm not sure which one of Fisher LSD/Wilcoxon is best for one-at-a-time)

QUOTE

Oh, and I agree that the objectives of a test should be clearly stated up front, *before* the test is performed, and that if any relationships are not of interest, they should be excluded.


Right. Also, if possible, decide which one you expect to do better in the comparison (that also halves the significance needed due to one-tail/two-tail)

QUOTE

Maybe the best way to do this is to perform two types of experiments:  exploratory ones and confirmatory ones.  The exploratory ones could give a general idea of what all the relationships look like, and the confirmatory ones would test specific ones, for example dm-ins versus dm-xtrm.  The implication is that the finer the distinction is you want to make, the fewer codecs should be involved.


Yep. This is why the first AQ test results are of good use: we know what to test for next time smile.gif

--
GCP
ff123
It seems the worth of Bonferroni adjustments (perhaps even the very idea of simultaneous testing of null hypotheses) is not universally accepted in all statistical circles.

For example, this page:

http://www.bmj.com/cgi/content/full/316/71...ch=&FIRSTINDEX=

with summary points as follows:

Adjusting statistical significance for the number of tests that have been performed on study data -- the Bonferroni method -- creates more problems than it solves.

The Bonferroni method is concerned with the general null hypothesis (that all null hypotheses are true simultaneously), which is rarely of interest or use to researchers.

The main weakness is that the interpretation of a finding depends on the number of other tests performed.

The likelihood of type II errors is also increased, so that truly important differences are deemed non-significant.

Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons.

ff123
ff123
And another link, this one from SISA, a site where one can perform free statistical tests using a web browser.

http://home.clara.net/sisa/bonhlp.htm

This website writes:

QUOTE
Scenario three concerns the situation when not predefined hypothesis are pursued using many tests, one test for each hypothesis. Basically this concerns the situation of data 'dredging' or 'fishing', many among us will recognize correlation variables=all or t-test groups=sex(2) variables=all. Above all, this should not be done. Bonferroni correction is difficult in this situation as the alpha level should be lowered very considerably in situations of such wealth (potentially with a factor of r*(r-1)/2, whereby r is the number of variables), and most standard statistical packages are not able to provide small enough p-value's to do it. SISA's advice is, if you want to go ahead with it anyway, to test at the 0.05 level for each test. After a relationship has been found, and this relationship is theoretically meaningful, the relationship should be confirmed in a separate study. This can be done after new data is collection or in the same study, by using the 'split sample' method. The sample is split in two, one half is used to do the 'dredging', the other half is used to confirm the relationships found. The disadvantage of the split sample method is that you loose power (use the procedure power to estimate how much). A Bayesian method can be used if you want to formally incorporate the result of the original study or dredging in the confirmation process. But don't put too high a value on your original finding.


ff123
ff123
And one more link here:

http://149.170.199.144/resdesgn/multrang.htm

QUOTE
Multiple range tests can be placed into two categories. 

1. Constant LSD. In these a single LSD is found and used to compare all pairs of means. Tests differ in the algorithm used to calculate the LSD. Examples : Fisher's LSD, Tukey's HSD, Sheffé's LSD and Waller-Duncan's LSD. 

2. Variable LSD. In these tests the means are ranked and the magnitude of the LSD is determined by the number of intervening means, between the two being compared. Examples: Newman-Keul's test, Duncan's multiple range test. 
The second group appear to be generally less accepted and recommended than the former. The following notes about the first group are based on comments by Swallow (1984). 

a. Tukey's HSD and Sheffé's LSD are too conservative, type II errors are favoured. 

b. Fisher's LSD is prone to type I errors, although this is not too serious when used after rejecting an analysis of variance Null hypothesis (i.e. when it is a protected test).

c. Waller-Duncan's LSD has few faults but the statistic is complex and tables are generally unavailable. 

If you require more information about multiple range tests the following are recommended: Swallow (1984), Chew (1980) and Day and Quinn (1989).


So I am getting the impression that Fisher's LSD (which I am using as a protected test) is a good approach. However, I should remove the option from my program to allow the user to adjust the critical significance of just the LSD. If anything it should adjust *both* the critical significance values of the Friedman/ANOVA and the corresponding LSD tests.

Waller-Duncan's LSD might be interesting as a side study, but Fisher's LSD is very easy to calculate once a Friedman or ANOVA has been performed.

ff123
ff123
Situations in which Fisher's LSD is weak:

from:

http://davidmlane.com/hyperstat/B96288.html

QUOTE
An approach suggested by the statistician R. A. Fisher (called the "least significant difference method" or Fisher's LSD) is to first test the null hypothesis that all the population means are equal (the omnibus null hypothesis) with an analysis of variance. If the analysis of variance is not significant, then neither the omnibus null hypothesis nor any other null hypothesis about differences among means can be rejected. If the analysis of variance is significant, then each mean is compared with each other mean using a t-test. The advantage of this approach is that there is some control over the EER. If the omnibus null hypothesis is true, then the EER is equal to whatever significance level was used in the analysis of variance. In the example with the six groups of subjects given in the section on t-tests, if the .01 level were used in the analysis of variance, then the EER would be .01. The problem with this approach is that it can lead to a high EER if most population means are equal but one or two are different.


next page:

http://davidmlane.com/hyperstat/B94854.html

QUOTE
In the example, if a seventh treatment condition were included and the population mean for the seventh condition were very different from the other six population means, an analysis of variance would be likely to reject the omnibus null hypothesis. So far, so good, since the omnibus null hypothesis is false. However, the probability of a Type I error in one or more of the 15 t-tests computed among the six treatments with equal population means is about 0.10. Therefore, the LSD method provides only minimal protection against a high EER.


ff123
YouriP
ff123, in the future when you want to post ammendments to your previous posts before a reply has yet been made, could you just edit the last post instead of posting 3 or 4 replies? Thanks.
Garf
QUOTE
Originally posted by ff123

Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons.


This should hardly be a surprise...

The problem is presenting those results in a way that a general public without statistical background can understand what the implication of the multiple tests really is.

For some reason I feel that many people have a problem with: 'these are our results, but keep in mind that there's a 70% chance something here is incorrect'. Doesn't look very scientific, though it's prefectly ok.

Note that the contesting of the Bonferroni correction is due to the conservativeness. For us, this doesn't actually matter so much. But if you are testing if a new medicine has effect, you don't want to take a risk of incorrectly rejecting the hypothesis that it works. The mathematics behind it are sound.

Let it be clear that I prefer a simultaneous test over a multiple 2-sample tests + correction. But I dont agree with doing a 2-sample test _without_ correction.

--
GCP
Garf
QUOTE
Originally posted by ff123
And another link, this one from SISA, a site where one can perform free statistical tests using a web browser.

http://home.clara.net/sisa/bonhlp.htm

This website writes:
ff123


Hmm, nothing new here either.

Make a test to see if there are trends.

Do another test to test those trends.

This is what you suggested just earlier.

The comment about Bonferroni is also in line what we saw. The alpha level in the AQ test gets as low as 0.0017. That's at the limit of accuracy SPSS uses for its results wink.gif

--
GCP
Garf
QUOTE
Originally posted by ff123
And one more link here:

http://149.170.199.144/resdesgn/multrang.htm
So I am getting the impression that Fisher's LSD (which I am using as a protected test) is a good approach. 


Hmm, I'm not convinced. I'd agree if we were talking about a small number of variables, but we've got 8.

I have this doubt because the Friedman test just says ' there is a difference between the samples '. This provides little protection if you are making 28 comparisons, though it obviously helps a lot if you only make 3 or so. It gets too easy to see false differences (aka Type 1 errors)

--
GCP
Garf
QUOTE
Originally posted by ff123

In the example, if a seventh treatment condition were included and the population mean for the seventh condition were very different from the other six population means, an analysis of variance would be likely to reject the omnibus null hypothesis. So far, so good, since the omnibus null hypothesis is false. However, the probability of a Type I error in one or more of the 15 t-tests computed among the six treatments with equal population means is about 0.10. Therefore, the LSD method provides only minimal protection against a high EER.
ff123


(Whats EER?)

I think this is basically saying what I said in my prev post, namely that when you make a lot of comparisons the fact that you know that 'there is a difference between samples' is not enough protection to prevent you from seeing differences where there aren't any.

--
GCP
ff123
Youri, I'll modify posts into one if they haven't been replied to yet. What is the purpose of this, though? Am I bumping this thread each time I post a new message, which doesn't happen if I just modify an older one?

Garf,

EER = Experiment-wise error rate.

Basically, the difference between using a simultaneous vs. a one-at-a-time method is the difference between trying to control a type I error (false difference in codecs is identified) vs. a type II error (true difference in codecs is not identified). That's also what I mean by being "conservative" or "agressive" about how one wants to be about analyzing the data. If you're looking for an airtight conclusion (mpc is better than cbr192), tukey's HSD will give you one, but it probably won't be very useful. On the other hand, if you're looking for some insight and are willing to accept some risk of a type I error, Fisher's protected LSD is much more sensitive.

This seems to be an area of controversy in statistics, just like there's a minor controversy over whether one-tailed tests of significance should be used (some conservative statisticians say that a two-tailed test should always be used, even in a confirmatory study, because if you're bothering to perform a test there must be some uncertainty about the outcome).

Perhaps a compromise solution that could accomodate us both would be to use Waller-Duncan's k-ratio t test, which, unlike Tukey's test, doesn't operate on the principle of controlling type I error. Instead, it compares the type I and type II error rates based on Bayesian principles. The only problem, I think, is that with the limited net search I've made so far, I haven't seen whether there is a non-parametric version of this.

ff123
Dibrom
QUOTE
Originally posted by ff123
Youri, I'll modify posts into one if they haven't been replied to yet.  What is the purpose of this, though?  Am I bumping this thread each time I post a new message, which doesn't happen if I just modify an older one?


Well for the record I don't really think there is anything wrong with posting multiple replies as long as it doesn't become redundant. Multiple replies would bump the thread multiple times too, but again I don't see much of a problem. I do see benefit in trying to keep all the posts consolidated if possible, but if the discussion is moving right along then it seems fine to me.

Just my 2 cents.
ff123
Found this powerpoint slideshow on the net:

http://www.css.orst.edu/TEACHING/Graduate/...ures/CHAP5A.PPT

Here are some relevant quotes:

QUOTE
The winner among winner pickers -- Cramer and Swanson (1973) conducted a computer simulation study involving 88,000 differences they compared LSD, FPLSD, HSD, SNK, BLSD both FPLSD and BLSD were better in their ability to protect against type I error and also in their power to detect real differences when they exist none of the other methods came close.


LSD = Fisher's LSD, without using an F test first
FPLSD = Fisher's protected LSD, only run if F test proves significant
HSD = Tukey's HSD
SNK = Student Newman Keuls test
BLSD = Bayes LSD (also known as Waller-Duncan's protected LSD)

QUOTE
The edge goes to BLSD... -- BLSD is prefered by some because it is a single value and therefore easy to use larger when F indicates that the means are homogeneous and small when means appear to be heterogeneous.  But the necessary tables may not be available, so FPLSD is quite acceptable


I'd like to get my hands on the Cramer and Swanson paper and also on the book which has the BLSD tables. I wonder which book has them? If I can get a hold of the tables, I can probably brute force the calculations by table lookup in my program.

ff123
CiTay
ff, maybe you want to mail this person who searched for a similar thing a while ago:

http://groups.google.com/groups?q=BLSD+tab...m.mathforum.com

Maybe she found some info in the meantime.
ff123
Thanks Citay, but I did some digging, and I think the following papers are relevant to the Bayes LSD:

Waller, R.A. and Duncan, D.B. (1969) "A Bayes Rule for the Symmetric Multiple Comparison Problem", Journal of the American Statistical Association 64, pp. 184-199

Waller, R.A. and Kemp, K.E. (1975) "Computations of Bayesian t-Values for Multiple Comparisons", Journal of Statistical Computation and Simulation (Vol 4, no. 3), pp. 169-172

Swallow, W. H. 1984. "Those overworked and oft-misued mean separation procedures - Duncans, LDS, etc." Plant Disease, 68: 919-921.

And a couple of books:

An Introduction to Statistical Methods and Data Analysis, 5th Ed., 2000, R. Lyman Ott, Duxbury Press, Belmont CA
Amazon link: http://www.amazon.com/exec/obidos/ASIN/053...4136099-6862928

Principles and Procedures of Statistics: A Biometrical Approach, 3rd Ed., 1996, Robert Steel and James Torrie
Amazon link: http://www.amazon.com/exec/obidos/ASIN/007...4136099-6862928

ff123
Garf
Thanks ff123. I'll have a look through the university library and check if they happen to have any of the relevant material.

If you know of anything that discusses the link between the Friedman protection and a high number of comparisons, please let us know. I'm a bit worried about it.

Edit: Hmm, also, aren't most of the methods discussed versions for the normal distribution?

--
GCP
ff123
QUOTE
If you know of anything that discusses the link between the Friedman protection and a high number of comparisons, please let us know. I'm a bit worried about it.


The SAS website has this:

QUOTE
It has been suggested that the experimentwise error rate can be held to the  level by performing the overall ANOVA F-test at the  level and making further comparisons only if the F-test is significant, as in Fisher's protected LSD. This assertion is false if there are more than three means (Einot and Gabriel 1975). Consider again the situation with ten means. Suppose that one population mean differs from the others by such a sufficiently large amount that the power (probability of correctly rejecting the null hypothesis) of the F-test is near 1 but that all the other population means are equal to each other. There will be 9(9 - 1)/2=36 t tests of true null hypotheses, with an upper limit of 0.84 on the probability of at least one type 1 error. Thus, you must distinguish between the experimentwise error rate under the complete null hypothesis, in which all population means are equal, and the experimentwise error rate under a partial null hypothesis, in which some means are equal but others differ.


So this supports the position that Fisher's protected LSD is not so protected for the case where there are a lot of means close to each other but one or two which are very different, as pointed out earlier.

QUOTE
Edit: Hmm, also, aren't most of the methods discussed versions for the normal distribution?


yes. I am thinking of looking through a book by Hollander and Wolfe, which concentrates on non-parametric methods, to see if they cover Waller Duncan Bayes LSD.

ff123
YouriP
QUOTE
Well for the record I don't really think there is anything wrong with posting multiple replies as long as it doesn't become redundant. Multiple replies would bump the thread multiple times too, but again I don't see much of a problem. I do see benefit in trying to keep all the posts consolidated if possible, but if the discussion is moving right along then it seems fine to me.
Yeah, the reason it's normally not allowed is to keep people from bumping their own threads all the time, or adding to the reply count just to make their thread look popular (yes, some people worry about that apparantly smile.gif) It's probably just a pet-peeve of mine I developed from visiting a lot of übermoderated fora. smile.gif Actually, it's mainly meant to prevent posts like:

"Hi, I'm Youri! How are you all doing?"
"Oh, I'm fine btw!"

In this case, the response could simply be edited into the original posts. That's why I was only speaking of ammendments - if a reply to your post has already been made and you want to make an ammendment still, it's usually better to post a reply instead of editing your original post, because otherwise people may not notice it.

But I'm making a bigger problem out of it than it is, so carry on. smile.gif
ff123
I have completed the code to perform an optional Tukey's HSD (either parametric or non-parametric). Version 1.20 of friedman.exe with source is at:

http://ff123.net/friedman/friedman120.zip

This version also outputs an ANOVA table, if that option is specified, and generates a matrix of difference values to show how the means or ranksums are separated.

ff123
Garf
Very cool programming work!

I have some more problems:

a) What can we do with data that is partially normal? For example, in the 128kbps test most data seems normal with the possible exception of the mpc and xing results, who 'bump up' to the ends of the rating scale? Is ANOVA permissible here?

b) What happens if we tranform the data relative to mpc? (i.e. subtract mpc score from everything)

b1) does it change any results?

b2) does it make the data 'more' normal?

--
GCP
ff123
QUOTE
a) What can we do with data that is partially normal? For example, in the 128kbps test most data seems normal with the possible exception of the mpc and xing results, who 'bump up' to the ends of the rating scale? Is ANOVA permissible here?


For the dogies test it doesn't matter if you choose ANOVA or Friedman, as long as the Fisher LSD is used.

Here is a good page on how to choose a statistical test:

http://www.graphpad.com/www/book/Choose.htm

A couple of quotes of interest:

"Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment."

and:

"When in doubt, some people choose a parametric test (because they aren't sure the Gaussian assumption is violated), and others choose a nonparametric test (because they aren't sure the Gaussian assumption is met)."

QUOTE
b) What happens if we tranform the data relative to mpc? (i.e. subtract mpc score from everything)


Nonparametric results should remain the same as long as the relative rankings are not changed. I don't know how the ANOVA results change.

ff123
Garf
QUOTE
Originally posted by ff123


For the dogies test it doesn't matter if you choose ANOVA or Friedman, as long as the Fisher LSD is used.

Here is a good page on how to choose a statistical test:

http://www.graphpad.com/www/book/Choose.htm

A couple of quotes of interest:

"Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment."


Hmm yeah, but I think the non-normal look of the Xing/MPC results will stay even if we add more listeners.

The distribution we have looks normal but it has a 'pile up' effect on the sides of the hardest and lowest samples. The AQ test has this too as it consists entirely of hard samples.

Although it fails a normaility test, your comment above has me in doubt. Is this 'clipping' effect described somewhere?

If it would turn out that although a normality test fails we can still use methods based on a normal distribution, that would be a major help...

Edit: Hmm, interesting link :

Choosing between parametric and nonparametric tests is sometimes easy. You should definitely choose a parametric test if you are sure that your data are sampled from a population that follows a Gaussian distribution (at least approximately). You should definitely select a nonparametric test in three situations:

• The outcome is a rank or a score and the population is clearly not Gaussian. Examples include class ranking of students, the Apgar score for the health of newborn babies (measured on a scale of 0 to IO and where all scores are integers), the visual analogue score for pain (measured on a continuous scale where 0 is no pain and 10 is unbearable pain), and the star scale commonly used by movie and restaurant critics (* is OK, ***** is fantastic).

'the visual analogue scale for pain' ... doesn't this apply to the Xing scores? wink.gif

• The data ire measurements, and you are sure that the population is not distributed in a Gaussian manner. If the data are not sampled from a Gaussian distribution, consider whether you can transformed the values to make the distribution become Gaussian. For example, you might take the logarithm or reciprocal of all values. There are often biological or chemical reasons (as well as statistical ones) for performing a particular transform.

Interesting...I need to think about this.

--
GCP
ff123
The analysis tool I wrote can now be run from the web:

http://ff123.net/friedman/stats.html

Have fun!

ff123
Garf
QUOTE
Originally posted by ff123

yes.  I am thinking of looking through a book by Hollander and Wolfe, which concentrates on non-parametric methods, to see if they cover Waller Duncan Bayes LSD.


Hollander, Myles: Nonparametric statistical methods / Myles Hollander, Douglas A. Wolfe. New York (N.Y.) : Wiley, 1999. XIV, 787 p..

Practical nonparametric and semiparametric Bayesian statistics / Dipak Dey, Peter Muller, Debajyoti Sinha (eds.).. New York (N.Y.) : Springer, 1998. XVI, 369 p. : ill..

I'll try to get them tuesday or wednesday. Library opening hours suck for me though sad.gif

--
GCP
Garf
I found something else that may be interesting.

I had been wondering for a while that, since we have a pretty good idea of how the actual distribution looks (+- normal with clipping), if there's no way to make use of that instead of a pure nonparametric test. If you are able to make use of more knowledge about the distribution, you should be able to get more sensitive tests.

Something like this already seems to exist and it's called Bootstrapping (seems to be a fairly new technique too).

Basically, starting from your sample you use a large number of simulations to determine the distribution function of your actual population.

I think that once you are able to determine the distribution function, it should be possible to create an appropriate test for inequality of means. I've only skimmed through the book I have quickly, but one method seems to be via simulations again.

On a related note, this book says that the Wilcoxon Signed Rank Test needs symmetric distributions. If that's the case, then it's not applicable to the AQ test results I think.

With some luck the Hollander/Wolfe book will answer these questions. I should be able to get it tomorrow afternoon.

--
GCP
ff123
This is a fascinating topic which I had no idea existed. My elementary statistics book is dated 1973, and apparently the field was opened up by Dr. Efron in 1977. The general technique is called resampling, and there are actually four different types of resampling methods:

1. the bootstrap, invented by Bradley Efron;
2. the jacknife, invented by Maurice Quenouille and later developed by John W. Tukey;
3. cross-validation, developed by Seymour Geisser, Mervyn Stone, and Grace G. Wahba
4. balanced repeated replication, developed by Philip J. McCarthy.

This page was informative, with some references:

http://ericae.net/pare/getvn.asp?v=3&n=5

Here's a conceptually simple example taken from that page:

****
For simplicity, let's assume that a district has 13 voucher students and 39 non-voucher students, and the mean difference is 10 standard score units. To empirically construct the distribution, we'd follow these steps:

1. Create a data base with all the student grades.
2. Randomly sort the data base.
3. Compute the mean for the first 13 students.
4. Compute the mean for the other 39 students.
5. Record the test statistic--the absolute value of the mean difference.

Then repeat steps 2 though 5 many times. That way, we'd get the distribution of mean differences when we randomly select students. The probability of observing a mean difference of 10 when everything is random is the proportion of experimental test statistics in step 5 that are greater than 10.
****

It's very simple conceptually, and does not assume either that data is normal or that the sample is randomly drawn from some population. Very nice.

ff123

Edit: another web reference lists the four major types of resampling methods as follows:

(http://seamonkey.ed.asu.edu/~alex/teaching...resampling.html)


****
There are four major types of resampling:

Randomization exact test: It is also known as the permutation test. Surprise! It was developed by R. A. Fisher, the founder of classical statistical testing. However, in his later years Fisher lost interest in the permutation method because there was no computers in his days to automate such a laborious method.

Cross-validation: It was developed by Seymour Geisser, Mervyn Stone, and Grace G. Wahba.

Jacknife: It is also known as Jackknife and Quenouille-Tukey Jackknife. It was invented by Maurice Quenouille (1949) and later developed by John W. Tukey (1958). The name "jacknife" was coined by Tukey to imply that the method is an all-purpose statistical tool.

Bootstrap: It was invented by Bradley Efron and further developed by Efron & Tibshirani (1993). It means that one available sample gives rise to many others by resampling (pulling yourself by your own bootstrap).

Among the four methods, the first and the last ones are more useful. The principles of cross-validation, Jacknife, and bootstrap are very similar but bootstrap overshadows the others for it is a more thorough procedure. Indeed, Jacknife is of largely historical interest today (Mooney & Duval, 1993) (Nevertheless, Jacknife is still useful in EDA for assessing how each subsample affects the model).
****
ff123
Here is my untutored guess on how a resampled analysis method would work on, for example, the AQ1 test data (using rank scale):

1. Convert to ranks instead of ratings
2. For each listener, randomize the order of the ranking, then add up each column (codec setting).
3. Calculate the difference between all pairs of columns. For 8 codecs, there will be 28 pairs of columns.
4. Repeat 1000 times (or however many times you want). At the end, one should have 28 distributions of differences.
5. Compare the actual difference in ranksums to the simulated distributions to come up with a p-value for each of the 28 pair comparisons.

Is it really that simple?

ff123
Garf
QUOTE
Originally posted by ff123
4. Repeat 1000 times (or however many times you want).  At the end, one should have 28 distributions of differences.
5. Compare the actual difference in ranksums to the simulated distributions to come up with a p-value for each of the 28 pair comparisons.


You don't even need the distribution. During the simulation, you just count how many times the difference in ranksums in the trials exceeded the one you've got.

At the end, you divide that by the number of trials. Voila wink.gif

--
GCP
ff123
Man, that sounds sweet. It shouldn't be too hard to code up given that my current program has almost all the bits and pieces needed.

So does this sidestep the problem of doing multiple pairwise comparisons? It seems like it, to me.

ff123
Garf
QUOTE
Originally posted by ff123
So does this sidestep the problem of doing multiple pairwise comparisons?  It seems like it, to me.


I don't really see how immediately? Any suggestion?

Edit3: Deleted Edit 1 and 2 biggrin.gif

If you can get pairwise results with a very high significance then it's interesting anyway as the overall result may still be significant even with 28 comparisons.

I got the Hollander & Wolfe book, but they gave me an edition from 1972. It doesn't have anything about bootstrapping, nor about the Bayesan LSD. It does have a simulatenous comparison method, but I think it's just the Nonparametric Tukey HSD (it gives the same results too...)

The second book was lended out...to my stats professor of last year.

If possible, I'll got back tomorrow and try to get my hands on the 1999 edition.

--
GCP
ff123
QUOTE
If you can get pairwise results with a very high significance then it's interesting anyway as the overall result may still be significant even with 28 comparisons.


That's what I meant -- that the overall result may still be significant even with 28 pairwise comparisons. So that we should both be able to agree on the results.
ff123
This page:

http://www.uvm.edu/~dhowell/StatPages/Resa...ndomOneway.html

implies that resampling methods for multiple means is not so simple as outlined in previous messages. It still talks about Bonferroni adjustments, for example.

Book reference:

Westfall, R. H. & Young, S. S. (1993) Resampling-based multiple testing. New York: John Wiley & Sons.

ff123

Edit: another page discussing p-value adjustments and resampling:

http://www.rz.tu-clausthal.de/sashtml/stat/chap43/sect14.htm
Garf
QUOTE
Originally posted by ff123
implies that resampling methods for multiple means is not so simple as outlined in previous messages.  It still talks about Bonferroni adjustments, for example.


You will always have that problem if you use pairwise tests to do a multiple comparison. The issue is that _if_ the bootstrap comparison is strong enough to give very significant pairwise results, there may be more results that are still significant even _after_ the Bonferroni correction.

The problem with Bonferroni is that it's so crude, and throws away a lot of results that may have been correct.

The second link is very interesting because it gives correction methods that are less crude.

Ironically, one of them is bootstrapping. It works by checking how many times you would incorrectly (because the bootstrapping uses random data) have concluded one of the p values was significant when it wasn't. More specifically, it determines the lowest significance level random data would give for each of the comparisons, and the proportion of tests in which that is lower than one of the significance levels you got on your test, is your new significance level.

I just _love_ this method. I can actually understand how it works smile.gif

--
GCP
PatchWorKs
Sto cercando volontari per effettuare (o eventulmente tradurre) test sui vari formati audio.
Contattatemi.

http://www.patchworks.it
Garf
I've written a crude utility based on your code. It doesn't do the simultaneous comparison correction yet.

(and may have bugs)

After 1 000 000 simulations:

Input file : aq1.txt
Read 8 treatments, 42 samples

[FUBAR formatted table snipped]

Resampling..........................................................................................
..........

cbr192 is worse than abr224 (0.01576)
cbr192 is worse than dm-xtrm (0.00030)
cbr192 is worse than mpc (0.00001)
cbr192 is worse than dm-ins (0.01259)
cbr192 is worse than cbr256 (0.01840)
cbr192 is worse than dm-std (0.00216)
abr224 is worse than mpc (0.01117)
r3mix is worse than dm-xtrm (0.01595)
r3mix is worse than mpc (0.00082)
mpc is better than dm-ins (0.01446)
mpc is better than cbr256 (0.00959)

After Bonferroni correction the alpha level is 0.001831, so that leaves:

mpc > cbr192, r3mix
dmxtrm > cbr192

For fun, I'm going to check whether testing vs the means gives more or less power.

--
GCP
Garf
Using means:

cbr192 is worse than abr224 (0.02259)
cbr192 is worse than r3mix (0.04574)
cbr192 is worse than dm-xtrm (0.00050)
cbr192 is worse than mpc (0.00000)
cbr192 is worse than dm-ins (0.00475)
cbr192 is worse than cbr256 (0.01107)
cbr192 is worse than dm-std (0.00016)
abr224 is worse than mpc (0.00149)
abr224 is worse than dm-std (0.03514)
r3mix is worse than dm-xtrm (0.04195)
r3mix is worse than mpc (0.00056)
r3mix is worse than dm-std (0.01679)
dm-xtrm is worse than mpc (0.04614)
mpc is better than dm-ins (0.00782)
mpc is better than cbr256 (0.00324)

So that would give:

mpc > r3mix, abr224, cbr192
dmstd > cbr192
dmxtrm > cbr192

--
GCP
ff123
Garf, very nice.

I've got Westfall and Young on order from barnesandnoble.com, should be an entertaining read.

ff123
Garf
I've got the simultaneous bootstrap correction implemented too, but darn, this thing needs horsepowers!

You need at least 10000 iterations to more or less converge on the alpha values, and then each time 10000 to check the alpha values.

10 000 * 10 000 = 100 000 000 tests!

Yikes!

--
GCP
ff123
lol!

How long does it take to run 100 million trials? What kind of computer do you have and how fast is it?

ff123

Edit: And what were the results?
Garf
It took about 1-2 minutes for a test with 1000 * 1000 trials (and it didn't converge very well for the really small values, so more is definetely needed). One with 10 000 * 10 000 should take about 100-200 minutes.

I have an Athlon 1000. Optimizing my code would probably make things faster though. I didn't exactly code for efficiency.

It's up at http://sjeng.org/ftp/bootstrap.c

If possible, could you proofread it for bugs? Note that I switched it to medians instead of rank scores. if I understand things correctly, you can use whatever gives the most power. Means could be very interesting.

Edit: We will know the results in about 100 minutes smile.gif

Edit2: I confused medians and means...I think.

--
GCP
ff123
I took a quick look at the code. I checked the random number generation implementation, which looked correct. If I understand the resampling algorithm correctly, the code randomly chooses a listener, then for that listener, it shuffles the codecs. Then it randomly chooses another listener, but it could choose the same listener (I think). It does that N times, where N is the number of listeners.

Why not just shuffle the codecs for each listener? BTW, this was just a 5 minute glance, so I may have interpreted the code wrong (I have a hard enough time looking over my own code sometimes).

Also, I noticed that your code limits the number of listeners to MAXSAMPLE, which you set to 50. Can it be implemented to not care how many listeners are in the data input?

ff123

Edit: There's at least one other way I can think of to resample, besides the way it appears you did it, and the way I just described: Pool all the rankings together from all listeners, then randomly grab rankings out of that pool one at a time (replacing rankings each time) to reconstruct a new matrix. Which is the correct way to do it?
Garf
QUOTE
Originally posted by ff123
I took a quick look at the code.  I checked the random number generation implementation, which looked correct.


Perhaps a simple improvement is to pick a faster, higher-quality random number generator. I think I have some of those still lying around smile.gif

QUOTE
If I understand the resampling algorithm correctly, the code randomly chooses a listener, then for that listener, it shuffles the codecs.  Then it randomly chooses another listener, but it could choose the same listener (I think).  It does that N times, where N is the number of listeners.
Why not just shuffle the codecs for each listener? 


I think you're right on this. I read up a bit more and the replacement/not replacement is one of the differences between a bootstrap and a randomization method. Since what we want is a randomization method, there should be no replacement. That said, the two are so closely related I expect no differences in the results.

QUOTE
Also, I noticed that your code limits the number of listeners to MAXSAMPLE, which you set to 50.  Can it be implemented to not care how many listeners are in the data input?


Sure, just set MAXSAMPLE higher smile.gif

QUOTE
Edit:  There's at least one other way I can think of to resample, besides the way it appears you did it, and the way I just described:  Pool all the rankings together from all listeners, then randomly grab rankings out of that pool one at a time (replacing rankings each time) to reconstruct a new matrix.  Which is the correct way to do it?


Under the null hypothesis each setting has an equal chance of getting a certain score from the range of values that a certain listener uses. But the scores between different listeners are not comparable measurements. Still under the null hypothesis, it does not seem true that a certain sample has an equal chance of getting a certain score from the full range of values all listeners use.

So, I do not think you can also randomize the listeners. (But if you use ranks, it doesn't matter at all.)

Edit: One of the websites you linked to describes this under 'Repeated measures'

--
GCP
Garf
cbr192 is worse than abr224 (0.02360 vs 0.48990)
cbr192 is worse than r3mix (0.04510 vs 0.69920)
cbr192 is worse than dm-xtrm (0.00050 vs 0.01380)
cbr192 is worse than mpc (0.00010 vs 0.00190)
cbr192 is worse than dm-ins (0.00440 vs 0.12540)
cbr192 is worse than cbr256 (0.01160 vs 0.29520)
cbr192 is worse than dm-std (0.00020 vs 0.00370)
abr224 is worse than mpc (0.00190 vs 0.05520)
abr224 is worse than dm-std (0.03310 vs 0.59550)
r3mix is worse than dm-xtrm (0.03970 vs 0.65770)
r3mix is worse than mpc (0.00040 vs 0.00810)
r3mix is worse than dm-std (0.01550 vs 0.36640)
mpc is better than dm-ins (0.00980 vs 0.26020)
mpc is better than cbr256 (0.00280 vs 0.07850)

Note that these have errors of _at least_ 0.0001, and are based on only 10 000 trials (which amounts to 100M of actual tests).

If you compare the first column (pairwise alphas) with the values after 1M trials, you will see that at least in one case (abr224/mpc) the error is large enough to change the result (simultaneous alpha in second column just above 5%).

Edit: Just to make it clear: the first value between brackets is the pairwise alpha, the second one is the alpha after correction for the simultaneous test. The second one should be smaller than 0.05 for a truly significant result.

--
GCP
ff123
So to compare the rank data, the resampling method yields:

cbr192 is worse than dm-xtrm (0.00050 vs 0.01380)
cbr192 is worse than mpc (0.00010 vs 0.00190)
cbr192 is worse than dm-std (0.00020 vs 0.00370)
r3mix is worse than mpc (0.00040 vs 0.00810)

And Friedman/Fisher LSD yields:

mpc is better than r3mix, cbr192
dm-xtrm is better than cbr192
dm-std is better than cbr192

It seems they yield the same results!

ff123
ff123
By the way, on an unrelated note, one can change the results a bit by eliminating listener number 16, who was quite severe with overall ratings (in fact, he is the most severe rater), but who rated dm-std as a 5.0. If you do that, the ranked data yields ranksums which put dm-xtrm before dm-std, just like the ANOVA does.

ff123

Edit: Oops, I meant that the parametric analysis is changed to look like the ranked method, where dm-xtrm is better than dm-std.
Garf
The difference being the resampling results hold guaranteed for all comparisons at the same time with > 95% certainty.

You can add abr224<mpc to the resampling results BTW. I checked it fell through because of a bad estimation of the alpha value after only 10000 trials, and am running a test with 25000 trials now (will take half a day). It was already confirmed to hold with the Bonferroni correction, which is safe (and even overconservative).

But hey, it's always nice to see things confirm each other wink.gif

--
GCP
Garf
QUOTE
Originally posted by ff123
By the way, on an unrelated note, one can change the results a bit by eliminating listener number 16, who was quite severe with overall ratings (in fact, he is the most severe rater), but who rated dm-std as a 5.0.


Hmm, that's not acceptable for doing actual analysis on though smile.gif

One thing I think I _can_ do is to simply eliminate everybody who gave all-5's. After resampling those results are not changed anyway, and they do not affect the differences between the means.

That would speed up the analysis quite a bit, but I want to crosscheck it really does not affect any results.

Edit: Hmm, it may make a small difference anyway, so I'm going to keep them in just to be sure.

--
GCP
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.