Well, if I read the quote, they used low volume music and *did* turn the volume up, just not unreasonably loud.
Again, I don't see any other way to practically test this. You can analyse the dither in other ways, but the actual tuning of the noise shaper is a psychoacoustics process.
Hearing beyond the 15th bit is extremely hard, how are you going to test different dithers for audibility there? You'll have to go a similar way, set up a soundproof room and spend a lot on time on listening tests.
Don't misunderstand me. I fully agree that the most solid test is in the actual playback conditions, 16 bit, volume at a default setting. It's just very hard. Each time you turn the knob, testing gets easier, and you will deviate very slightly. It's not like the hearing curves suddenly turn upside down. I think you are right to criticise the test, but it's not sensible to say the results are completely flawed if you haven't been able to perform a better one - because while the hearing curves do change with loudness, they change smoothtly, it's not like they turn upside down.
It's like saying Newton is wrong because you observed a relativistic effect. Yes, you might be right, but until Einstein comes along Newtons results are certainly very usable

If I get your proposal, you want to evaluate the SNR gain in the frequency domain, and then turn up the dither as loud as possible while still being inaudible.
This sounds great in theory, but you might run into practical issues. You'll end up blasting VERY LOUD HF noise through the reproduction system, and not all of them will like that. Think blown tweeters, so be carefull
Good luck!
PS. The fact that audibility and masking levels are dependant on loudness and the listener isn't news...You will be able to apply a lot more dither if the person listening is 40 years old and deaf to 13kHz than against a 15 year old that hears 18kHz.