QUOTE(Kalamity @ Feb 9 2004, 03:00 AM)
Some of these codecs have an 'advertised' setting that is supposed to be transparent, coincidentally averaging around 160-200kbps. Perhaps holding them to this would be appropriate here?
True, but I would not base this test on how the codecs are marketed. They can call whatever they like "transparent" or "CD quality". But ABX results don't lie.
QUOTE(Kalamity @ Feb 9 2004, 03:00 AM)
A pass or failure would determine an appropriate direction (lower or higher) for a second test to determine operational tolerance.
That's exactly what I had in mind.
QUOTE(2Bdecided @ Feb 9 2004, 6:51 AM)
Just how many people are going to give you anything except 5.0 for all samples?
(I can think of some - but if you slashdot it to get a large number of listeners, I bet the percentage is low!)
There's also the possibility that people who can hear problems with various samples at the settings you suggest have already reported it here. Maybe you could somehow analyse this data?
Some people will have already made themselves more sensitive to one codecs artefacts than another. This would likely bias your test.
A rating scale wouldn't even be used for this kind of test. Only ABX. If the tester can get p<0.05, then they "move up" to the next higher encoding rate for the format. If they can't, then their transparency threshold for that sample encoded in that format lies below this rate and above the last one they
could differentiate.
And "artifact familiarity" won't have a statistical impact if there are enough testers. Some people would be "attuned" to a format's particular artifacts, but many others won't be.
QUOTE(sthayashi @ Feb 9 2004, 11:47 AM)
There are two additional codecs that ought to be tested, WavPack Hybrid and OptimFrog DualStream. These are codecs that have never been formally tested in lossy modes, and Somebody should do it™
I agree that they should be tested at some point against the ones pointed out previously in this thread. But it should really be done a future test, because a) there will be enough test groups as it is with the formats discussed, and b) I want to first pare down these five most commonly used formats, before tackling others.
QUOTE(music_man_mpc @ Feb 9 2004, 3:30 PM)
We should start making some preparations for this test right away. It will be exceedingly time consuming to come up with settings for all these different encoders that all have the same nominal bitrates. I suggest, since we are mainly testing Musepack here to start at --quality 4, then go to --quality 4.1, --quality 4.2 etc, until we reach statistical transparency, so to speak, presumably somewhere between --quality 4 and 5.
I agree. And the more I think about it and try to "envision" what the test would be like, I'm thinking we should have one test run for each format, close enough together to minimize unfairness by "version variance" between encoders.
And I don't want to only have 3 rates, as I previously stated. It wouldn't be enough. "Vorbis -q 4 isn't transparent to me, but -q 5 is." OK, so transparency for this tester on this sample in this format has been narrowed to within 32-52kbps of the "line". Not accurate enough. I want the scale to be as granular as possible.
As you point out in your example, I'd like to know a sample's transparency to a particular person with a format to within 10kbps or so.
QUOTE(MGuti @ Feb 9 2004, 4:00 PM)
i recomend splitting the test up by encoder. if its a strictly ABX test, then it woun't be quite as time consuming, IMO, as normal test. you can either tell or you can't.
My thoughts exactly. I'm hoping it'll make the whole thing more manageable in "smaller chunks". But it would have to be almost a "marathon" of tests. If we wait a month between testing each format, then too many people would say "Yeah, but you tested the old MPC against the new Vorbis v1.3", etc. We could, if possible, prepare for testing
all the formats at once (over, say, 6-8 weeks), then we could fire off one test, 11 days, then 3 days to compile and publish results, then fire off the next test, 11 days, 3 days to compile/publish, ...and so forth. Prep time in between tests would be minimal if we were set up at the beginning as much as possible. The whole thing, with 5 formats, would take about 10 weeks.
QUOTE(ChristianHJW @ Feb 9 2004, 4:03 PM)
To make this test sensible, you have to remove the 'noise', i.e. the people who dont have the necessary training to differentiate between those codecs.
I recommend to achieve this by either
- doing a pretest, like users have to find out what the 320 kbps MP3 and the original CD is ( quite easy )
- add the original source ( CD ) to the listening test, and null every vote that ranks the original worse than one of the compressed samples
Pre-tests may be required to determine which particular format variants would be the "most fair to test at mid-high bitrates", but there would be no subjective rating. ABX only. It would not be possible to "rate a reference" in this kind of test. With each encoder setting tested, it's just p<0.05 or p>0.05. The former shows transparency, the latter does not. Maybe we should define a "gray area" of 0.05>p>0.07, perhaps, to show an "exploded view" of the threshold when compiling results. I'm not sure of how much value this would hold, though. It can always be determined at the end, and even shown both ways if preferred.
QUOTE(Continuum @ Feb 9 2004, 4:08 PM)
I think it's still quite safe to assume that MPC is the best encoder for transparent lossy.
It's that word, "assume", that we will be killing with this test. If MPC wins, no need to "assume" any more.

If not, or if it ties for the top position with other formats, then "assumptions" can summarily be corrected.
QUOTE(Continuum @ Feb 9 2004, 4:08 PM)
IIRC there are some optimizations that kick in at quality level 5 (and are important to quality).
Then, as Tyler says, that is simply the nature of MPC. As in Roberto's tests, we should seek to minimize worrying too much about how formats "scale" their quality settings. If MPC would indeed perform better with a more shallow quality "slope" between q4.1 and q5, then maybe it should be modified to do just that.
This idea is simply to test the best encoder version that each of these formats brings to the table when measured at the threshold of perceptual transparency. And as mentioned before, we could spend the next few weeks pre-testing the different versions of each encoder (especially the ones with newer versions), and picking ideal samples for this kind of test.