AAC @ 128kbps listening test discussion |
![]() ![]() |
AAC @ 128kbps listening test discussion |
Mar 1 2004, 00:25
Post
#276
|
|
![]() Group: Members (Donating) Posts: 478 Joined: 22-November 01 From: United Kingdom Member No.: 519 |
I stand by my proposal. Untamperable ABX scores is IMO the most reliable tool to know whether the person could hear the difference or not. Since we are in a position where every valid result is valuable, dumping ranked references which have been ABXed successfully seems a waste of data we usually cannot afford. Moreover, ABX test provide data not only from the p-value but also, to a certain degree, from the number of trials the person needed to get the significant result (esp. if we have access to the ABX trial sequence for that sample). If a person needed 40 trials to get a statistically valid result you can tell at least that the difference wasn't all that obvious on the part he ABXed (and much less with 200 trials). However, a respectable result of say, 10/11, 14/16 or 20/24 speaks for a consistently picked up difference. In such a case, counting ranked references as 5.0 also removes potentially relevant information about a codecs quality, since we are disregarding the evidence that the sample indeed is not transparent.
Ranking a sample >4.0 acknowledges that the sample doesn't really sound worse than the original to the listener, only "different"; and therefore, with good ABX results backing it, it is reasonable to assume that due to fatigue or a lapse of concentration, the reference could be mistakenly ranked, without making the result less valid. The course of action I would take is: 1) See if we have enough kosher results to get statistically valid results. 2) If not, or if the non-kosher, marginally valid results are too voluminous to be discarded with a clear conscience; then take the rank value of the ranked references with successful, non-extreme ABXs above 4.0. |
|
|
|
Mar 1 2004, 00:52
Post
#277
|
|
|
Group: Members Posts: 245 Joined: 10-February 04 From: London Member No.: 11923 |
Sorry I didn't help for this test
For the discussion about ranked references, I would take them as 5.0, unless there is evidence of a mistake at the time of ranking (ABX results with small p-value) in which case I would use the specified rank. I would also contact the authors to inform them of what you do with their ranked references. Bogus data (someone playing with the sliders) can be recognised in that it doesn't fit with the other results and makes the error margin excessive. |
|
|
|
Mar 1 2004, 00:59
Post
#278
|
|
|
Group: Members Posts: 245 Joined: 10-February 04 From: London Member No.: 11923 |
Just realised that ranked reference with a small ABX value does not necessarily indicates a mistake. As commented before, the tester could like more the encoded sample. Make them all 5.0 I think.
|
|
|
|
Mar 1 2004, 01:11
Post
#279
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
From my own experience:
it happens that I've rated the reference, but ABX the file without any difficulty. How? Often, the mistake occurs with the first pair of the test. I'm not careful, and I hear weird sound which is a property of the reference (some samples are weird for some users). Then, I'm rating (correctly) all other files, but I forget to reevaluate the first file. This mistake is rare (occuring when I'm performing the test too quickly), but it happens to me some times... In my opinion, if ABX tests are correct, the wrong notation must be invert, and not be cancelled. |
|
|
|
Mar 1 2004, 01:24
Post
#280
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
Fortunately, Phong's wonderful chunky result parser can take into consideration ABX pvals to accept or discard ranked references (you may now bow to Phong), so it would be quite easy and fast for me to calculate results without ranked references and with ABXd ranked references, and then decide if the ranked references will make a difference in final scores or not.
Considering I/we decide to go with ABX results to decide about using a ranked references or not, what would be a good pval cutoff line? p < 0.05? p < 0.01? BTW: It can also parse results taking into consideration number of ABX trials (Phong rocks). Do you think this should be used to select results too? This post has been edited by rjamorim: Mar 1 2004, 01:32 -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 01:29
Post
#281
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
Just a comment (to ABC/HR software development).
I'm somtimes tempted to perfom multiple ABX tests on a difficult part of a sample. There are some points where artifact is obvious, but I'd like to test something more difficult. I never do it, simply because I fear to fail on this test. It wouldn't be serious to rank a file 2.0/5, with bad ABX scores. Therefore, comment should be added ofor ABX sessions (and why not, the possibility of doing multiple ABX tests for one single sample). Simply to inform the reader of the final log about the real conditions of the ABX tests (and score) This post has been edited by guruboolez: Mar 1 2004, 01:30 |
|
|
|
Mar 1 2004, 01:53
Post
#282
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
So? Opinions on pvalues? Pretty please? :/
-------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 02:11
Post
#283
|
|
|
Java ABC/HR developer Group: Developer Posts: 175 Joined: 17-September 03 Member No.: 8879 |
QUOTE (rjamorim) Considering I/we decide to go with ABX results to decide about using a ranked references or not, what would be a good pval cutoff line? p < 0.05? p < 0.01? By "using a ranked reference", do you mean using the score for the encoded sample or counting it as a 5.0? As for pval cutoff, that's not an easy question. I usually try to get it down to 1% if the sample is really hard. But if I can hear a somewhat clear difference, I usually don't bother going on after achieving 7/8 (pval=3.5%) or something similar. So if you cut off at 1% you might discard some instances where the listener heard a pretty good difference but just, well, missed the slider. I don't know how others handle this. Maybe you could try it and see what p-val most people go for, and if too much would be discarded at 1%, use 5% as cutoff. Or you could count the listener's rating if pval is <1% and count it as 5.0 if pval <5%. QUOTE (rjamorim) BTW: It can also parse results taking into consideration number of ABX trials (Phong rocks). Do you think this should be used to select results too? No. A big number of trials is not necessarily a sign that the listener didn't hear a clear difference. He might have switched the playback range halfway through or needed a bit of practice (I remember one test where I got 1 correct during the first 8 trials, and 23 correct in the following 24 trials). The pval is the measurement of significance, after all, and it should be used as such. QUOTE (guruboolez) I'm somtimes tempted to perfom multiple ABX tests on a difficult part of a sample. There are some points where artifact is obvious, but I'd like to test something more difficult. I never do it, simply because I fear to fail on this test. It wouldn't be serious to rank a file 2.0/5, with bad ABX scores. Yeah, I know that problem. You could try a different part and if it doesn't work out, add a few more trials from a part which is easy for you to get the pval back down. But I admit this is not the most elegant of solutions. QUOTE (guruboolez) Therefore, comment should be added ofor ABX sessions (and why not, the possibility of doing multiple ABX tests for one single sample). Simply to inform the reader of the final log about the real conditions of the ABX tests (and score) How would you want such a comment feature to work? (I mean, how would it be different from the sample comments?). Multiple tests might be a solution, but it also might be hard to handle. Or the program could just write a more detailed ABX log into the results file, maybe something like WinABX. But then the ABX tests would probably make up the largest part of the results files. Anyway, ideas are welcome. edit: Woohoo! 100th post. This post has been edited by schnofler: Mar 1 2004, 02:15 |
|
|
|
Mar 1 2004, 02:13
Post
#284
|
|
![]() ABC/HR developer, ff123.net admin Group: Developer (Donating) Posts: 1396 Joined: 24-September 01 Member No.: 12 |
QUOTE (rjamorim @ Feb 29 2004, 04:53 PM) So? Opinions on pvalues? Pretty please? :/ One other option: Recalculate the ABX p-value, including the trial where the listener ranks the reference. So if the ABX reads 14/16 in the abx section and then he ranks the reference in the ABC/HR section, the total ABX would be 14/17. That's what I did for the first 64 kbit/s test. It's a lot of manual labor, though But to answer your question, I would use p = 0.05 |
|
|
|
Mar 1 2004, 02:21
Post
#285
|
|
![]() Group: Members (Donating) Posts: 478 Joined: 22-November 01 From: United Kingdom Member No.: 519 |
QUOTE (rjamorim @ Feb 29 2004, 06:24 PM) Considering I/we decide to go with ABX results to decide about using a ranked references or not, what would be a good pval cutoff line? p < 0.05? p < 0.01? BTW: It can also parse results taking into consideration number of ABX trials (Phong rocks). Do you think this should be used to select results too? I'd say p < 0.05, since it's usually the scientifically accepted p-value of significance and I don't think we want to be too strict with the "ranked reference consideration threshold". I do, however, think we should take into consideration the number of ABX trials, since a large amount of trials does speak for a hard to distinguish difference which might as well be considered as 5.0 (for the sake of Garf's rationale). (*) How large should this amount be, is arguable. I would set the cutoff whenever the number of successes to get a p <0.05 for the given number of trials is less than 75%, this could be seen from some table. (As a rough estimate I'd say the treshold should lie above 30, which is, from my experience, the number of trials I need for nearly-transparent samples). BTW, does Phong's chunky enable you to set a rank limit (such as my proposed >4.0) for consideration as well as how to consider the ranked references (as a 5.0 or the given value)? If so, I would also enforce the minimum rank of 4.0 for taking the given score and set a 5.0 for below-threshold scores. How credible is a test that states the original is annoying with respect to the encoding? (*) EDIT: After reading Schnoflers reply, I acknowledge he has a point regarding the "warming up" or range selection during ABXing. This kind of issues could be easily solved if the program tracks the ABX trials, instead of the final result only. A long run of successes after several failures suggests the warm-up kicking in or a change of range to an easier one. However, interspersed successes and failures over many trials DOES almost unequivocally speak for a hard to distinguish difference. This post has been edited by Dologan: Mar 1 2004, 02:34 |
|
|
|
Mar 1 2004, 02:48
Post
#286
|
|
![]() 3ivx developer Group: Developer Posts: 52 Joined: 10-December 03 Member No.: 10346 |
QUOTE (robUx4 @ Mar 1 2004, 03:14 AM) I'm conducting the test right now, but sometimes the applet crashes Yes, it seems to freeze when it plays past the end point (which is shouldn't be doing) I just saved my sessions after any significant change... unfortunately I did lose hard ABX results sometimes -------------------- http://www.3ivx.com
|
|
|
|
Mar 1 2004, 02:49
Post
#287
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
QUOTE (Dologan @ Feb 29 2004, 10:21 PM) BTW, does Phong's chunky enable you to set a rank limit (such as my proposed >4.0) for consideration as well as how to consider the ranked references (as a 5.0 or the given value)? Unfortunately, I don't think it does. Anyway, I am fairly sure I got enough "clean" results. And preliminary testing shows that there won't be much of a difference if I count ranked results or not. So I will publish "clean" results at the results page, but will also make the ranked individual results available for download so that people can use them to play around with and see if they come with different rankings somehow. Just to clarify: If there are enough clean results (and I'm fairly sure there will be), the clean results will be used to create the "official" rankings that will be posted at rjamorim.com/test. I will discard any files where the reference was ranked, even if it was ABXd. Edit: by "clean", I mean result files where the test participant didn't rank the reference files. This post has been edited by rjamorim: Mar 1 2004, 02:56 -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 02:55
Post
#288
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
And enlightening comment about the fun of ABX, courtesy of music_man_mpc:
QUOTE 1L File: Sample04\gone_5.wav 1L Rating: 4.9 1L Comment: I GOT IT!!!! I FINALLY GOT IT!!!!! 67/102 for ABX. GODDAMN $#@! HARD!! @bond: People do take more than 100 trials, it seems -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 03:00
Post
#289
|
|
![]() 3ivx developer Group: Developer Posts: 52 Joined: 10-December 03 Member No.: 10346 |
QUOTE (bond @ Mar 1 2004, 07:40 AM) the results of abx'es should in no way influence the decision whether to use the ranked sources results or not!!! noone is forced to do abx, you cant rely on whether someone did abx or not in fact someone can do the whole test without abxing and without being unserious or having anything bad in mind as i proposed unserious testers should be sorted out via the way if there are far over the average ranked sources in the results QUOTE (Garf @ Feb 29 2004, 09:37 PM) No, you are completely wrong, see my earlier example. You can ABX 110/200, which is significant, but your chance of pulling the correct slider is only 55%. well your example is not usable in this case as its unrealistic/only theoretical noone will do 200 abx'es ABX results are part of the test dude I know someone who abxed every single sample to >0.001 And yes... some people do abx 200+ times... although the most I ever went was about 50 I think Luckily I don't claim to be a golden ear -------------------- http://www.3ivx.com
|
|
|
|
Mar 1 2004, 03:01
Post
#290
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
One hour to go!
After that, I will stop receiving results and will start calculating the plots. If you didn't send me your results yet, do so now! -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 03:03
Post
#291
|
|
![]() 3ivx developer Group: Developer Posts: 52 Joined: 10-December 03 Member No.: 10346 |
QUOTE (rjamorim @ Mar 1 2004, 07:59 AM) QUOTE (ff123 @ Feb 29 2004, 05:55 PM) This is a mess, isn't it? Jesus Christ, it is! So, is it normal for a test to have this many ranked references? Or is it a sign of how aac@128kbps is fairly transparent for a large segment of the test population? -------------------- http://www.3ivx.com
|
|
|
|
Mar 1 2004, 03:07
Post
#292
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
QUOTE (Stux @ Feb 29 2004, 11:03 PM) So, is it normal for a test to have this many ranked references? Or is it a sign of how aac@128kbps is fairly transparent for a large segment of the test population? Second answer I guess several people are finding out their transparency thresold is around 128kbps, and not around 160-192 as it has been believed for years (although the latter figures probably still apply to MP3) -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 03:07
Post
#293
|
|
![]() Group: Members (Donating) Posts: 478 Joined: 22-November 01 From: United Kingdom Member No.: 519 |
QUOTE (Stux @ Feb 29 2004, 08:03 PM) Or is it a sign of how aac@128kbps is fairly transparent for a large segment of the test population? Yes. It's also a sign that people are not always careful, that placebo exists and that they dislike not being able to hear a difference and put a 5.0. |
|
|
|
Mar 1 2004, 03:15
Post
#294
|
|
![]() 3ivx developer Group: Developer Posts: 52 Joined: 10-December 03 Member No.: 10346 |
QUOTE (rjamorim @ Mar 1 2004, 11:24 AM) Considering I/we decide to go with ABX results to decide about using a ranked references or not, what would be a good pval cutoff line? p < 0.05? p < 0.01? BTW: It can also parse results taking into consideration number of ABX trials (Phong rocks). Do you think this should be used to select results too? Whatever it is when ABC/HR.jar goes green I don't think non abx'd non-ranked-reference results should be discarded , after all abxing is NOT compulsory... -------------------- http://www.3ivx.com
|
|
|
|
Mar 1 2004, 03:17
Post
#295
|
|
![]() 3ivx developer Group: Developer Posts: 52 Joined: 10-December 03 Member No.: 10346 |
QUOTE (guruboolez @ Mar 1 2004, 11:29 AM) Just a comment (to ABC/HR software development). I'm somtimes tempted to perfom multiple ABX tests on a difficult part of a sample. There are some points where artifact is obvious, but I'd like to test something more difficult. I never do it, simply because I fear to fail on this test. It wouldn't be serious to rank a file 2.0/5, with bad ABX scores. Therefore, comment should be added ofor ABX sessions (and why not, the possibility of doing multiple ABX tests for one single sample). Simply to inform the reader of the final log about the real conditions of the ABX tests (and score) Yes, for instance, you want to see if you can ABX one artifact, but know you can ABX another artifact... Anywho, you could do this by opening up a new session and doing your 'hard' abx in that session... -------------------- http://www.3ivx.com
|
|
|
|
Mar 1 2004, 03:36
Post
#296
|
|
![]() Group: Members (Donating) Posts: 3474 Joined: 7-November 01 From: Strasbourg (France) Member No.: 420 |
QUOTE (Stux @ Mar 1 2004, 03:17 AM) Anywho, you could do this by opening up a new session and doing your 'hard' abx in that session... Mmhhhh... It's possible if the whole test include one file against the reference. But if you have 5 files (and therefore 10, because of blind reference for each encoded sample), you have first to find again the specific sample where you believe you heard something wrong. Not easy... and waste of time. And with encrypted log files, it's even harder to be sure that you're ABXing the good file |
|
|
|
Mar 1 2004, 03:44
Post
#297
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
Can someone enlighten me on the origins of Velvet?
http://lame.sourceforge.net/download/samples/velvet.wav All I know is that it was submitted by Roel (r3mix). Does anybody know artist (Velvet Underground?), title and album of this song? Also, what would be the style (no way to figure out from just the introduction) Thank-you. -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 04:01
Post
#298
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
Listening test is now CLOSED
Expect results to be posted real soon... Regards; Roberto. -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
Mar 1 2004, 13:22
Post
#299
|
|
|
Group: Members Posts: 881 Joined: 11-October 02 Member No.: 3523 |
rjamorim
sorry for bugging again, but what way on how to treating ranked refernces did you use now? i couldnt see that from your posts btw also you seemed to have missed out my sample 09 results!? maybe it was because they were all transparent rated? QUOTE (Stux @ Mar 1 2004, 03:00 AM) ABX results are part of the test dude no its not -------------------- I know, that I know nothing (Socrates)
|
|
|
|
Mar 1 2004, 18:12
Post
#300
|
|
![]() Rarewares admin Group: Members Posts: 7515 Joined: 30-September 01 From: Brazil Member No.: 81 |
QUOTE (bond @ Mar 1 2004, 09:22 AM) rjamorim sorry for bugging again, but what way on how to treating ranked refernces did you use now? i couldnt see that from your posts I simply ignored ranked references. Since I got enough "clean" results, it was OK. QUOTE btw also you seemed to have missed out my sample 09 results!? maybe it was because they were all transparent rated? Nope. I couldn't decrypt your sample 09 results. It's the only result file that gave me problems in the entire test. I sent it to schnofler so that he can investigate. Sorry about that. This post has been edited by rjamorim: Mar 1 2004, 18:13 -------------------- Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 23rd May 2013 - 19:10 |