QUOTE(Ivan Dimkovic @ Feb 6 2006, 05:13 PM)
Interesting, as it seems - although, outside of the scope of the tests, pn average PS was ranked significantly lower than HE - tests were indeed separated, but the anchors were the same.
It is also worth noting that LAME -b128 was ranked 4.41 on the first test, and only 4.08 at the second one.
Dunno what exactly caused this difference - but I think the tests cannot be directly compared (as it was planned), and the upcoming 48 kbps multivendor AAC test will finally give the most accurate answer of HE-AAC v1 vs. V2 ranking.
I tested only 6 of the samples. How did the different samples score? Is there any pattern? Also, at one stage you had more results from the Test1. Did that change? If not, would it be possible that those who send also Test2 results were more experienced listeners, thus more critical in general?
Listener training is probably a big factor, if most of the testers started with Test1 and then continued with Test2.
Perhaps my test order eliminated the listener training effect in my case. It was this:
1. Sandman Test2
2. Sandman Test1
3. Sympathy Test2
4. Sympathy Test1
5. Harp40_1 Test1
6. Harp40_1 Test2
I did all testing in the same evening. It took about 5 hours, so I spent about 8 minutes with each codec/setting (including a couple of short breaks). Most of time I tried to put the AAC encoders in order and that was pretty tough. The AAC samples had more or less subtle differences, but it was not easy to decide which artifacts are less annoying. I listened the samples over and over again until I was pretty sure about the results. All 28 samples would have taken about 24 hours to test at this testing speed. I guess I am hopelessly slow and need more training...
I didn't know it was LAME -b 128 that failed badly with the harp40_1 sample and got 1.4. I spent a lot of time trying to find the high anchor from the samples that were audibly better. Though, I was a bit surprised that the 1.4 samples were quite good after the first two totally broken notes. As I said earlier, the only other obvious difference was with the harp40_1 test1 AAC samples. Two of the AAC settings didn't have one particularly bad artifact with one of the higher notes. I thought one of those better contenders was LAME@128.