Your post raises good questions. There may be cases where the original is not really music, but just crap, in which case I suppose that acoustic defects might actually enhance it (especially dropoffs in volume, smoothings or mufflings, or mistaken encodings of golden silence)--I've often wished my neighbors would use a "bad" encoder that just encodes silence. Also, some people prefer "brightened" tracks, so a brightened lossy might sound "better," esp. if the original itself is thin, hollow, watery, flat, etc., e.g. elevator "music" (I assume that's one reason for a high anchor sanity check). But other than issues like that, why wouldn't a double-blind test be possible and valid?
True the test picks up only what the listener notices. That's a severe limitation if being asked to rank worth or quality of music, since that would be sharply limited by possession of taste & uncommon abilities & passions (and also you can't double-blind the type of music, so a bias of bad taste can't be technically adjusted for). But purely low-level technical qualities are not really a problem like that, unless there's really a bias, as you suggest, in favor of certain types of even simple technical flaws (e.g. if ha posters were biased not to hear warbling as a defect: if their favored encoders produced warblings, and they had gradually convinced themselves not to notice or be bothered by that). It could also be simpler bias: if some people don't tend to hear a certain kind of technical defect that would bother other people. Likewise, there's the reverse (and I suspect much more likely bias here): some people are bothered by defects that wouldn't even be noticed by other people.
But these bias issues don't seem to be a decisive objection to me. It's like a vision test. I doubt people are actually preferring blurrings or defects (and also there's the high-anchor check), with the possible exception of brightening or sharpening if the original is flat or blurry etc. One person may have poor eyesight or be colorblind, and perhaps most people will seem defective next to a guy with 20/10 or 20/5 vision, let alone a guy with UV or IR vision abilities, but it seems that's not a deep bias problem, as it would be with judging the worth of what's being seen or heard. It would be a deep bias problem if there were radical individual differences between most human beings in hearing (then a general lossy encoding would be impossible), but those differences seem extremely minor compared to differences in taste etc., and I don't see people generally preferring any significant types of simple technical defects or blurriness etc in this area (unlike other areas of life...). If you're saying, with one person *only*, he might rank high anchor best, but then rank e.g. discoloration as less bad than blurriness, that's true. But with more people in the test such uniform kinds of bias are unlikely (I agree there are some exceptions, e.g. children will hear frequencies that the old won't, and also have better eyesight).
So: (1) there are partisans of crappy music, and partisans of good music, but you don't see partisans of types of acoustic distortion -- i.e. no deep bias / judgment issues; and (2) I suspect--but I'm just an uneducated layman--that there's no significant preference for one type of distortion over another in these tests, when talking of distortions actually hearable by a human mind, if the distortion is of significance. i.e. maybe BUUUUUUUUUUUUU over MYOINGYOINGYOINGYOING if they're both extremely
and equally slight effects, but not once either gets large or gets much larger than the other. That's just a guess. Perhaps that should be tested by double-blind tests: deliberate testing of sensitivity to pre-echo versus dampening versus cutoffs, or types of warbling etc. I'd be curious what the people who write the encoders, who probably know, say about which preferences exist for what distortions, but I suspect that, if those distortions are actually hearable by a human being (i.e., rather than being beyond human auditory range or being cancelled out by other sound etc.), that they are going to be very slight and few. The more serious bias is the non-random selection of the samples themselves, i.e. if the samples are skewed (apart from, of course, being music, and preferably music one actually wants to hear) such that one encoder tends to do well on them (certain problem samples etc.) but I think they try to be careful about that.
QUOTE(Porcupine @ Jun 1 2007, 19:06)

My personal view is that there is no way to 100% objectively compare any encoders at low bitrates such as 128 kbits/s. On many or most materials they are not transparent. Depending on your personal listening abilities, they may be quite far from transparent.
If you have 2 non-transparent, but different-sounding samples, how do you say which is better? You can only say what is "better" or "worse" to you but that's subjective. You can't double-blind test that. You just have to say it, and it's subject to your own opinion and what you happened to notice and what you didn't happen to notice.
For years I thought Fraunhofer 256 kbps (mp3 producer pro 2.1 version) was pretty good. Then one day I realized that on almost every song with bass notes, you can hear the bass notes wobble (this may be a violation of forum rules since I'm not providing proof, so if anyone requires me to retract this statement I will). On LAME or the original WAV file a bass note sounds powerful and steady....like this BUUUUUUUUUUUUU. But on Fraunhofer it sounds like this MYOINGYOINGYOINGYOING. To me, 99% of songs can be identified from the original in a 1-second clip because you just have to listen to this bass difference. (it's pretty hard though, take lots of practice and probably only marginally ABX-able...again no proof so I will retract if required).
Different codecs function so differently (even different codecs of the same format...such as mp3...LAME and Fraunhofer have totally different encoding strengths and weaknesses and settings) that when you compare two 128 kbits/s recordings that happen to ABXable from each other and the original...you're just going to be picking on whatever YOU notice...which is only a subset of all the things that can be noticed...and your judgement will be extremely biased in my opinion.
EDIT: You can test all you want and post it here, I have no problem with it. But I would raise a complaint if you tried to post the results of your personal 128 kbps listening test on wikipedia.