I'm unsure of the goal. In the "Limitations" part you clearly state that this measure isn't usefull to determine perceived quality.
Also, if you look at the four graphs in that chapter, on the second sample we see that the D_f score of WMA and Vorbis is about 20dB apart while they score very similar in the listening test (considering the probability margins, they have the same quality). If I understand correctly, you are comparing power and 20dB would then correspond to a factor 100!
Said differently, this measure can deviate by a factor of 100 without change in perceived quality.
In chapter four a possible other use is discussed:
QUOTE
... this method could be helpful when you need to determine the type of some coder (or another test object) by comparing it with the known ones or for example to trace changes of a coder engine from version to version.
I doubt this is practical.
For this to work you need to have both an original and a processed version.
As long as you have the original you can run it through any encoder you like and compare the result to the processed version which should be much more accurate.
Granted, this method requires constant access to many codecs and in the long run requires more processing power. But judging by the variance of D_f value for a specific codec it will be difficult to reach a satisfactory confidence level.
I'm sorry this turned out to be quite destructive criticism.