Statistics For Abx

Topic: Statistics For Abx (Read 40623 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Statistics For Abx

2002-08-27 16:58:11

Hopefully in the near future, I can implement an indicator of whether or not a listener should continue to perform ABX testing based on certain specified parameters:

alpha: probability of stating that a difference occurs when it does not (this is the parameter we are typically concerned with, which is usually set to 0.05)

beta: probability of stating that no difference occurs when it does

p0: the expected proportion of correct decisions when the samples are identical (0.5 for ABX)

p1: the expected proportion of correct decisions when the odd sample is detected (other than by guess).

We have historically not concerned ourselves with beta and p1, but I think it would be advantageous to do so for tests of very subtle differences.

ff123

Statistics For Abx

Reply #1 – 2002-08-27 18:51:36

Is this related to my question concerning guessing probability?

http://www.audio-illumination.org/forums/i...2ac97d72fa86932

Statistics For Abx

Reply #2 – 2002-08-27 20:57:23

Ideally, I would pop up a graph with the y axis showing the number of correct responses and the x axis showing the number of total trials. There would be two lines, one of which shows how many correct responses there need to be for any particular number of total trials to achieve 0.05 significance for hearing a difference. The other line would show at what point the listener should just give up, because the chance of getting a false negative is below what the chosen beta and p1 indicate. I'll post the formula later tonight.

ff123

Statistics For Abx

Reply #3 – 2002-08-27 21:52:36

ff13,

You may want to consider expressing the running (and final) result of an ABX test using confidence intervals. It is different from the more commonly used “hypothesis testing” method, but it could be a nice compliment here.

For example:

In another thread you reported the results of a test where you scored 52/82 giving a p-value of 0.010.

You could also report that, during the test, your probability of choosing the correct sample was p = 0.634 (i.e., 52/82) with a 99% confidence interval of (+/- 0.137) (i.e., p = 0.634 +/- 0.137).

Notice that the lower bound of the confidence interval just overlaps p=0.5, which jives with what the p-value indicates.

For comparing subtle differences in ABX tests I think this could be quite useful as compared to using a p-value alone, which gives no information about the magnitude of a perceived difference.

The caveat with this method is that the binomial distribution becomes non-normal at the extreme edges (where p is close to 1 in the case of ABX). Calculating the confidence interval becomes less than trivial in this case (otherwise it is quite simple).

… as far as your software is concerned, it sounds like you have some good ideas… but it also looks a little complicated. I’ll have to think about it.

Statistics For Abx

Reply #4 – 2002-08-27 21:54:14

... hmm, I can post when not logged in???

ff123,

You may want to consider expressing the running (and final) result of an ABX test using confidence intervals. It is different from the more commonly used “hypothesis testing” method, but it could be a nice compliment here.

For example:

In another thread you reported the results of a test where you scored 52/82 giving a p-value of 0.010.

You could also report that, during the test, your probability of choosing the correct sample was p = 0.634 (i.e., 52/82) with a 99% confidence interval of (+/- 0.137) (i.e., p = 0.634 +/- 0.137).

Notice that the lower bound of the confidence interval just overlaps p=0.5, which jives with what the p-value indicates.

For comparing subtle differences in ABX tests I think this could be quite useful as compared to using a p-value alone, which gives no information about the magnitude of a perceived difference.

The caveat with this method is that the binomial distribution becomes non-normal at the extreme edges (where p is close to 1 in the case of ABX). Calculating the confidence interval becomes less than trivial in this case (otherwise it is quite simple).

… as far as your software is concerned, it sounds like you have some good ideas… but it also looks a little complicated. I’ll have to think about it.

Statistics For Abx

Reply #5 – 2002-08-28 07:28:20

Well, it turns out I don't understand the statistics quite well enough to be confident about adding it to abchr.

The type of analysis is called a sequential test. For example:

http://home.clara.net/sisa/sprthlp.htm
and
http://education.indiana.edu/~frick/decide/intro.html

The formula for the lower line (below which similarity is declared and the test is stopped) is:

d0 = log(beta) - log(1-alpha) - n*log(1-p1) + n*log(1-p0) /
{ log(p1) - log(p0) - log(1-p1) + log(1-p0) }

The formula for the upper line (above which a difference is declared and the test is stopped) is:

d1 = log(1-beta) = log(alpha) - n*log(1-p1) + n*log(1-p0) /
{ log(p1) - log(p0) - log(1-p1) + log(1-p0) }

alpha, beta, p0, and p1 are as described in my first post.

Basically, alpha, beta, p0, and p1 are decided upon prior to a test, and the test continues until the number of correct trials exceeds the upper line or goes below the lower line.

When I entered this into an Excel spreadsheet, though, and put in typical values for the test parameters, it invariably resulted in having to get more correct trials per n than what the binomial distribution would give. So I need to understand why that is.

Here is a chart showing what this test is all about. In this example, let's say that someone scored the first 4 trials correct, but then got the next 6 trials incorrect. At that point, the sequential test would tell him to stop any further trials, because it would just be a waste of time. Of course, the counter to this savings in time is that it now takes 9 consecutive correct trials before the test is stopped on the other side. Straight binomial distribution only requires 5 consecutive correct trials to declare a difference at 95% confidence.

ff123

Statistics For Abx

Reply #6 – 2002-08-28 10:24:27

I don't have my statistics books with me and it's a bit late to read a ton on it at 4am anyway, but from a quick perusal (primarily of your 2nd link) it seems like they're taking something additional into account to make sure that "the test is really safe to stop now". I.e. with a normal binomial distribution after n trials you say "p < 0.05, so it's significant", but with this method, they seem to be saying "if we were to stop right now and do a normal analysis, p < 0.05, but are we reasonably confident that p will stay below 0.05 as the test continues?"

Of course my reading of it could be completely wrong, as 4am is not the best time to do statistical analysis. =]

But I'd expect there to be something different about the lines, or else there would be no need for a separate method of sequential analysis -- you'd just do the normal analysis after each test, and stop when p < 0.05.

Statistics For Abx

Reply #7 – 2002-08-28 14:50:35

I guess what I'm having trouble understanding is why wouldn't someone just calculate the alpha and beta risks directly for each situation (like I currently do for alpha), and then stop the test when either alpha or beta dips below some pre-specified level?

Another website:

http://www.uib.no/isf/medseq.htm

As near as I can tell, if one is allowed to see the results of an ABX test in progress, then one needs to set the level of significance stricter than 0.05. See Table 1 of the website reference above.

Also, apparently the method of using the double lines to stop a sequential test is pretty old and hoary. The web page mentions Repeated Statistical Tests, replacing the borderlines with repeated t-tests.

ff123

Edit: using intuitive arguements: basing a decision to stop a test on knowledge of the progress (as is currently the case with abchr) is subject to a bias. I.e., I always stop the test when it's advantageous for me to do so. That would seem to imply that a stricter stopping criterion is needed to be equivalent to a test where the number of trials is pre-determined.

Question has been submitted to sci.math.stat, as this appears to be a rather large issue which needs to be resolved.

Statistics For Abx

Reply #8 – 2002-08-28 16:49:44

One more thing, Table 1 on the page I listed above also seems to imply (just like the lines in the chart I drew) that for a sequential test, one must achieve a "nominal" significance of 0.01 for 9 total trials to be equivalent to a fixed test significance of 0.05. That is, one should perform at least 9 out of 9 trials on a sequential test (results known after every trial) before stopping an ABX test. Current fixed test stopping point is 5 out of 5.

Probably a conservative approach would be to use the double-line method for now until I can figure out how to refine it using simulation.

ff123

Statistics For Abx

Reply #9 – 2002-08-28 16:49:56

Quote

I guess what I'm having trouble understanding is why wouldn't someone just calculate the alpha and beta risks directly for each situation (like I currently do for alpha), and then stop the test when either alpha or beta dips below some pre-specified level?

I think the bottom line is that obtaining a certain p-value is not quite the same as obtaining the corresponding level of confidence in a test where the significance level is chosen a priori. This is what continuum was talking about in his previous thread.

As an example, this means that when you score 12/16 on an ABX test, the probability that you obtained the score by chance is actually greater than what the p-value indicates (by how much I don't know). This may seem at first absurd but it has to be true.

Calculating alpha and beta values for each situation (during the test) undermines the a priori part of significance testing and should be done with caution.

Statistics For Abx

Reply #10 – 2002-08-28 17:15:48

I didn't understand the previous thread then, but I understand the problem now, I think. And there is apparently at least one way to do a rough adjustment (the double line method), and as usual, several ways to refine it, the most accurate way probably being some sort of simulation.

ff123

Statistics For Abx

Reply #11 – 2002-08-28 20:16:32

Quote

... hmm, I can post when not logged in???

ff123,

You may want to consider expressing the running (and final) result of an ABX test using confidence intervals. It is different from the more commonly used “hypothesis testing” method, but it could be a nice compliment here.

For example:

In another thread you reported the results of a test where you scored 52/82 giving a p-value of 0.010.

You could also report that, during the test, your probability of choosing the correct sample was p = 0.634 (i.e., 52/82) with a 99% confidence interval of (+/- 0.137) (i.e., p = 0.634 +/- 0.137).

Notice that the lower bound of the confidence interval just overlaps p=0.5, which jives with what the p-value indicates.

For comparing subtle differences in ABX tests I think this could be quite useful as compared to using a p-value alone, which gives no information about the magnitude of a perceived difference.

The caveat with this method is that the binomial distribution becomes non-normal at the extreme edges (where p is close to 1 in the case of ABX). Calculating the confidence interval becomes less than trivial in this case (otherwise it is quite simple).

… as far as your software is concerned, it sounds like you have some good ideas… but it also looks a little complicated. I’ll have to think about it.

shday,
How did you calculate the confidence interval? With what formula? (I suppose you are using something like: 0.99 = NORMALDIST( (x-np)/sqrt(npq) ) where n=82, p=q=1/2, but that's a wild guess. )

I don't really understand what the interpretation of this interval is. Which probability are we investigating?

For comparison, here's a graph with traditional p-val calculation: The y-coordinate of a black point is the required number of correct ABX-trials out of the total number of trials, represented via the x-coordinate, to achieve a p-val greater than 0.99 (respectively 0.95 with red).
http://www.freewebz.com/aleph/095-graph.png

Statistics For Abx

Reply #12 – 2002-08-28 20:23:49

Situation 1: the number of trials is determined. Depending on user input is the calculated p-val (Probability to get the same or a better result by guessing).

Situation 2: a level of confidence (calculated the same as the p-val above) is required, the number of trials used (test length) is irrelevant.

Obviously the two situations are quite different, which is our current problem. The question is: What is the probability to reach a certain level of confidence by guessing, if one is allowed to use unlimited many tests? It is clear, that the probability of achieving 0.99 confidence by guessing is greater than 0.01. But how much?

(This reminds me of the theory of the "simple random walk", where you reach a certain point with probability=1 but infinite expected moves...)

Statistics For Abx

Reply #13 – 2002-08-28 23:45:51

Here is a web page of the clearest explanation I have seen so far of the problem:

http://www3.mdanderson.org/depts/biostatis...tatmethods.html

----------------
An excerpt:

Consider another sequential design, one of a type of group-sequential designs commonly used in clinical trials. The experimental plan is to stop at 17 tries if 13 or more are successes or 13 or more are failures, and hence the experiment is stopped on target. But if after 17 tries the number of successes is between 5 and 12 then the experiment continues to a total of 44 tries. If at that time, 29 or more are successes or 29 or more are failures then the null hypothesis is rejected. To set the context, suppose the experiment is nonsequential, with sample size fixed at 44 and no possibility of stopping at 17; then the exact significance level is again 0.049. When using a sequential design, one must consider all possible ways of rejecting the null hypothesis in calculating a significance level. In the group-sequential design there are more ways to reject than in the nonsequential design with the sample size fixed at 17 (or fixed at 44). The overall probability of rejecting is greater than 0.049 but is somewhat less than 0.049 + 0.049 because some sample paths that reject the null hypothesis at sample size 17 also reject it at sample size 44. The total probability of rejecting the null hypothesis for this design is actually 0.080. Therefore, even though the results beyond the first 17 observations are never observed, the fact that they might have been observed makes 13 successes of 17 no longer statistically significant (since 0.08 is greater than 0.05).

To preserve a 0.05 significance level in group-sequential or fully sequential designs, investigators must adopt more stringent requirements for stopping and rejecting the null hypothesis. That is, they must include fewer observations in the region where the null hypothesis is rejected. For example, the investigator in the above study might drop 13 successes or failures in 17 tries and 29 successes or failures in 44 tries from the rejection region. The investigator would stop and claim significance only if there are at least 14 successes or at least 14 failures in the first 17 tries, and claim significance after 44 tries only if there are at least 30 successes or at least 30 failures. The nominal significance levels (those appropriate had the experiment been nonsequential) at n=17 and n=44 are 0.013 and 0.027, and the overall (or adjusted) significance level of rejecting the null hypothesis is 0.032. (No symmetric rejection regions containing more observations allow the significance level to be greater than this but still smaller than 0.05.) With this design, 13 successes of 17 is not statistically significant (as indicated above) because this data point is not in the rejection region.
-----------------

The best way to minimize this problem would be to take fewer "looks" at the results in progress. For example, suppose I looked at the results after 7 trials, then 14, then 21, and 28, where 28 would be the maximum allowable trials before I stop the test altogether. 4 looks would mean that the nominal significance at each look would have to be about 0.016 to achieve an overall significance of 0.05 (according to table 1 at the MEDSEQ website).

It's possible that I could move the first look down to trial 6, and still be able to keep the nominal significance at 0.016 (although I'd have to simulate to make sure). That would be the best case, because it would allow a forced ABX test to take place in which the minimum number of trials to achieve significance is set to something as low as reasonably possible.

ff123

Statistics For Abx

Reply #14 – 2002-08-29 00:45:17

Quote

shday,
How did you calculate the confidence interval? With what formula? (I suppose you are using something like: 0.99 = NORMALDIST( (x-np)/sqrt(npq) ) where n=82, p=q=1/2, but that's a wild guess. )

I don't really understand what the interpretation of this interval is. Which probability are we investigating?

The CI was calculated from the standard deviation of the observed proportion of successes:

standard deviation = sigma = sqrt(pq/n) where p = 52/82, q = 1 - p, and n = 82

Then the CI was calculated assuming a normal distribution:

CI = p +/- 2.58 * sigma (the 2.58 comes from the 99% confidence. In Excel the formula is NORMSINV(0.005)= 2.5758...)

CI = 0.634 +/- 2.58*0.053 = (0.497, 0.771)

As a rule of thumb, the assumption of a normal distribution can be considered adequate if:

(1/sqrt(n))(sqrt(q/p)-sqrt(p/q)) < 0.3

There are tables that give exact CI's for binomial distributions. Unlike the above approximation, they are never centred exactly at p (except when p = 0.5). If one were interested, there should be an way to calculate the exact CI's (no normal distribution assumption).

The 99% confidence interval given for 52/82 should be interpreted as follows: upon repeated tests, 99% of the intervals calculated this way will include the true value of p. This also means that, if the interval does not include p=0.5, there is a >99% probability the listener heard a difference (sort of).

Most of this stuff in new ground for me. It comes from "Statistics for Experimenters" by Box, Hunter and Hunter (1978).

Statistics For Abx

Reply #15 – 2002-08-29 02:01:08

Quote

The best way to minimize this problem would be to take fewer "looks" at the results in progress. For example, suppose I looked at the results after 7 trials, then 14, then 21, and 28, where 28 would be the maximum allowable trials before I stop the test altogether. 4 looks would mean that the nominal significance at each look would have to be about 0.016 to achieve an overall significance of 0.05 (according to table 1 at the MEDSEQ website).

It's possible that I could move the first look down to trial 6, and still be able to keep the nominal significance at 0.016 (although I'd have to simulate to make sure). That would be the best case, because it would allow a forced ABX test to take place in which the minimum number of trials to achieve significance is set to something as low as reasonably possible.

ff123

IMO this seems like a simple and rigorous way to improve your tool. Also, the results may as well be kept hidden from the listener until the predetermined points. This has the additional advantage of keeping the trials more independent.

Statistics For Abx

Reply #16 – 2002-08-29 08:22:20

I just did a quick calculation using a modified pascal triangle and the 0.95 confidence points from my chart above, basically I assumed a simplified version of situation 2: ABX trials are attempted by guessing. The test is stopped when a confidence level of 0.95 is reached (using the traditional p-val method) or when 16 tests are completed.
The result: the probability to pass this ABX test is 0.08755, i.e. significantly more than 0.05. (If there's no mistake on my side, this is an exact value)

Statistics For Abx

Reply #17 – 2002-08-29 08:59:49

I hacked a quick and dirty sequential ABX simulator with the ability to perform up to 6 "looks." Here is the screen shot:

Each of the 5 looks in this example (max of 30 trials are allowed) has a nominal p less than 0.05, but the total p for the test is about 0.05. You can download this simulator (don't increase Num Sims too much or you'll hang your computer!) at:

http://ff123.net/export/seqsim.zip

ff123

Edit: The following looks and number correct distribute the alphas a little more evenly.

look = 6, numcorrect = 6, nomalpha = 0.016
look = 12, numcorrect = 10, nomalpha = 0.019
look = 18, numcorrect = 14, nomalpha = 0.015
look = 23, numcorrect = 17, nomalpha = 0.017
look = 28, numcorrect = 20, nomalpha = 0.018

total alpha = 0.050

Statistics For Abx

Reply #18 – 2002-08-29 12:14:55

I've wrote a program for evaluating the corrected p-val for situation 2. It takes the maximum allowed number of ABX trials and the required confidence (i.e. traditional p-val) as arguments and returns the probability to pass the test with guessing.

Example:
(4, 0.95) -> 0 (impossible)
(5, 0.95) -> 0.03125 (5/5 -> pval=.96875)
(6, 0.95) -> 0.03125 (either the test is won at 5/5 or lost)
(16, 0.95) -> 0.08755493164
(100, 0.95) -> 0.2020580977 (!!!)
(16, 0.99) -> 0.01422119141

The idea is:
Construct a pascal triangle up to a certain level.

Code: [Select]

    1    
    1    1
    1    2    1
    1    3    3    1
    1    4    6    4    1
    1    5    10    10    5    1
    ....

Now we can read it as follows: A(row=trials+1, column=correct+1) / 2^trials = P(abx=correct/trials | trials used) (e.g. the probability to score 2 times correct out of 3 is A(4, 3)/2^3 = 3/8)
The pval would be the sum of all those probabilities to the left of the chosen item.

My program does the following: seek for the earliest win condition (e.g. 5/5), calculate its probability (e.g. 0.03125) and set the corresponding item in the pascal triangle to 0 and recalculate, e.g.:
   1   5   10   10   5   0
   1   6   15   20   15   5   0
   ....
The last step is to make sure, that nothing is counted twice. Now start again. The next win condition with probability /= 0 is 7/8, with remaining probability of P(abx=7/8 | not 5/5) = 0,0195.., and so on...

Here's the Maple source code, I hope it's readable (# starts a comment):

Code: [Select]

CorrPVal:=proc(n,reqConfidence,Prob)
local k, Trial, LastResult, Result, Confidence:

  Result:=array([1,seq(0,i=1..n)]):  # initialize [1,0,...,0]
  Prob:=0:

  for Trial from 1 to n do           # create new line of triangle
    LastResult:=copy(Result):        # only the last lines is required, so nothing
                                     # more is stored / copy to help variable
    Confidence:=0:
    Result[1]:=1:                    # set first element to 1

    k:=1:                            # now the rest
    Confidence:=Confidence+binomial(Trial,0)*1/2^Trial:

    while Confidence < eval(reqConfidence) and k <= Trial do
# check if target confidence is reached or all trials has been attempted
      Result[k+1]:=LastResult[k]+LastResult[k+1]:
# calculate new element of Pascal triangle
      Confidence:=Confidence+binomial(Trial,k)*1/2^Trial:
# increase Confidence (->more trials were correct)
      k:=k+1:
    end do:

    if k<=Trial then                 # winning condition
      Prob:=evalf(eval(Prob)+(LastResult[k]+LastResult[k+1])/2^Trial):
# add to sum of all winning probabilities
    end if:
  end do:
end proc:

# now follows the execution
# the result is stored in variable 'prob'
CorrPVal(16,0.95,prob):
prob;
# the result is displayed

Statistics For Abx

Reply #19 – 2002-08-29 15:43:14

I think I'll probably create several different typical ABX "profiles" for a listener to choose from. For example, one of the profiles will be the 28 max trials case, using 5 looks into the progress (4 of them before the end) as shown in my last edited message. This gives the listener 4 decision points to terminate early, but still meet the overall p.

The only problem I haven't figured out yet is what to do if the listener terminates the test in between the look points.

ff123

Edit: Ok, I believe I can create an entire profile by using the simulator to pick enough points that I can create a sort of look up table, so that if the listener terminates in between look points, I can still tell if the overall p was met. But basically, the in-between termination becomes an extra look point. I'll construct such a table later.

Statistics For Abx

Reply #20 – 2002-08-29 16:35:18

One more thing: I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

For example, outcome 1: all 5 trials are correct. The listener cannot terminate early and still have the overall p be less than 0.05. Outcome 2: all 5 trials are incorrect. The listener still has a chance of getting 17 out of 23 or 20 out of 28 to pass the test.

But the reason I'm not sure is because the listener may form an estimate of his chances of succeeding and decide to terminate because of this information.

Oh, for the 28-trial profile, I probably ought to tell the listener that he's wasting his time if he gets more than 8 trials incorrect.

ff123

Statistics For Abx

Reply #21 – 2002-08-29 17:10:07

Quote

One more thing: I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

If you want the tool to be statistically sound than don’t let the listener see the progress at all, even at the look points. What does it add to the test anyhow? You’ve now taken steps to ensure that the listener isn’t wasting time (very nice solution btw). As far as I can tell, saving wasted time was the only valid reason for allowing the listener to watch the progress in the first place. As I’ve said before, knowing the progress of the test compromises the independence of the trials and should be avoided if possible.

Quote

The only problem I haven't figured out yet is what to do if the listener terminates the test in between the look points.

You probably shouldn’t be too concerned about the listener quitting early. If someone does a test they should be encouraged to stick it out to the end rather than quitting. I see no harm in allowing the test to terminate when the listener makes 9 incorrect choices. This way the listener will have no good reason to quit. (If they do quit, than give them the p-val, with all its caveats, and be done with it!)

Statistics For Abx

Reply #22 – 2002-08-29 18:27:11

Quote

If you want the tool to be statistically sound than don’t let the listener see the progress at all, even at the look points.

That could be another profile: for example, mandatory 16 trials, no quitting early, only get results at the end. BTW, the profile scheme with look points is just as sound statistically as the no-look profile, provided that the overall p comes out at 0.05 or below.

Quote

What does it add to the test anyhow? You’ve now taken steps to ensure that the listener isn’t wasting time (very nice solution btw). As far as I can tell, saving wasted time was the only valid reason for allowing the listener to watch the progress in the first place.

Yes, that's the whole idea of this exercise. I want to make it as convenient as possible for the listener to complete a valid ABX session. A no-look ABX test can be a huge time-waster.

I think I have the "terminate-in-between-look-points" problem solved, and I think I'll allow the listener to see the first 5 trials in the 28-trial profile (in addition to the look-point at trial 6).

Are there any other profiles that might be useful? Perhaps a very large profile, like 60 max trials? Although I'm not sure who would actually use such a profile.

ff123

Statistics For Abx

Reply #23 – 2002-08-29 20:11:44

Quote

The best way to minimize this problem would be to take fewer "looks" at the results in progress. For example, suppose I looked at the results after 7 trials, then 14, then 21, and 28, where 28 would be the maximum allowable trials before I stop the test altogether. 4 looks would mean that the nominal significance at each look would have to be about 0.016 to achieve an overall significance of 0.05 (according to table 1 at the MEDSEQ website).

Until now I was busy calculating the corrected p-vals where the user is allowed to look at his results after every trial, but it shouldn't be too difficult to modify my source code to incorporate certain look-points (which probably is the best solution).
This way, we could calculate exact values. Furthermore, the algorithm is acceptable fast (polynomial), so I think it's quite possible to integrate a realtime calculation to your program.

Quote

Edit: Ok, I believe I can create an entire profile by using the simulator to pick enough points that I can create a sort of look up table, so that if the listener terminates in between look points, I can still tell if the overall p was met. But basically, the in-between termination becomes an extra look point. I'll construct such a table later.

BTW: Here is the same program as above in Excel VBA (version 95) (for those infidels ): http://www.freewebz.com/aleph/CorrPVal.xls

Quote

The only problem I haven't figured out yet is what to do if the listener terminates the test in between the look points.

This might be a very difficult question. Allowing to choose the time of termination will increase complexity.

Quote

One more thing: I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

Theoretically, it shouldn't hurt, as the gained information is of no value to a guessing test person.

Quote

But the reason I'm not sure is because the listener may form an estimate of his chances of succeeding and decide to terminate because of this information.

Quote

You probably shouldn’t be too concerned about the listener quitting early.

I agree with shday here. If the listener still believes he can here a difference, he will continue the test anyway. If not, I don't think he would be able to abx something, he can't here.

Quote

That could be another profile: for example, mandatory 16 trials, no quitting early, only get results at the end. BTW, the profile scheme with look points is just as sound statistically as the no-look profile, provided that the overall p comes out at 0.05 or below.

Yes, profiles seem to be a good way to satisfy everyone.

Quote

Are there any other profiles that might be useful? Perhaps a very large profile, like 60 max trials? Although I'm not sure who would actually use such a profile.

Garf, for example. And I myself used long runs before.

Statistics For Abx

Reply #24 – 2002-08-29 20:25:17

Here is the lookup table I would use for the 28-trial profile:

*0 wrong: at least 6 of 6 (can't have fewer than 6 trials with 0 wrong)
1 wrong: at least 9 of 10 (can't have fewer than 10 trials with 1 wrong)
*2 wrong: at least 10 of 12
3 wrong: at least 13 of 16
*4 wrong: at least 14 of 18
5 wrong: at least 17 of 22
*6 wrong: at least 17 of 23
7 wrong: at least 19 of 26
*8 wrong: at least 20 of 28

Notes:
* = look points
1. overall test significance is 0.05
2. listener is not allowed to perform ABX trials past the max of 28.
3. listener is allowed to see trials 1 through 5 in addition to the early-decision look points
4. ABX is terminated if listener gets 9 or more trials wrong.
5. listener can terminate at any time, with overall results taken from the above table.

ff123