Help - Search - Members - Calendar
Full Version: Analysis by synthesis
Hydrogenaudio Forums > Hydrogenaudio Forum > Scientific Discussion
Gabriel
QUOTE
Below 1kHz, especially, overcoding is also a good idea because the ear is not only level sensitive but also waveform sensitive at those frequencies, and a full analysis by synthesis would be the only way to be sure you were 'under threshold' if you don't overcode there.

Overcoding there might be interesting for fast encoders implementations. But isn't current Vorbis encoder using analysis by synthesis, like most encoders?
SebastianG
If by analysis by synthesis you mean the "inner loop" that adjusts scale factors according to current quantization errors (LAME does that, right?): No AFAIK Vorbis(*) doesn't have something like this. Considering how the floor curves are coded this sort of inner loop would be pretty complicated I guess. Vorbis? No analysis-by-synthesis I know of...

Sebi

edit: (*) I mean all current encoder implementations.
Garf
As far as I understood Monty doesn't believe in the entire idea of measuring (non)audible quantization noise, which would kill a major reason to do analysis-by-synthesis.
Woodinville
QUOTE(Garf @ Mar 26 2006, 01:32 PM)
As far as I understood Monty doesn't believe in the entire idea of measuring (non)audible quantization noise, which would kill a major reason to do analysis-by-synthesis.
*




I don't understand. If one could efficiently use analysis by synthesis to find out where the noise can be best put in order to have the most possible unaudible noise, why wouldn't one do that?
Gabriel
So current Vorbis is "blindly" allocating bits, without even checking in the last step?
I am surprised.

Analysis by synthesis pros:
*allows to theorically reach the optimal combination
*allows to overcome the non-linear quantization problems in noise estimation

cons:
*should be slower than a direct computation.

Did I missed something?
Woodinville
QUOTE(Gabriel @ Mar 26 2006, 11:25 PM)
So current Vorbis is "blindly" allocating bits, without even checking in the last step?
I am surprised.

Analysis by synthesis pros:
*allows to theorically reach the optimal combination
*allows to overcome the non-linear quantization problems in noise estimation

cons:
*should be slower than a direct computation.

Did I missed something?
*



Not that I've noticed.
foxyshadis
QUOTE(Gabriel @ Mar 27 2006, 12:25 AM)
So current Vorbis is "blindly" allocating bits, without even checking in the last step?
I am surprised.

Analysis by synthesis pros:
*allows to theorically reach the optimal combination
*allows to overcome the non-linear quantization problems in noise estimation

cons:
*should be slower than a direct computation.

Did I missed something?
*


In order to trust rate-distortion optimization, you first have to fully trust your psychoacoustic model. If it's not accurate - and it probably needs to be much more accurate than your actual coding model - it will only give you useless results for the extra slowdown. Or at best, non-optimal.
Gabriel
QUOTE
In order to trust rate-distortion optimization, you first have to fully trust your psychoacoustic model. If it's not accurate - and it probably needs to be much more accurate than your actual coding model - it will only give you useless results for the extra slowdown. Or at best, non-optimal.

That is why, at least during development, you should have an encoding mode that is fully trusting your model, as this allow to find model flaws.

I do not think that this is the reason for current Vorbis encoder to not use AbS.
There must be something else. Perhaps Vorbis is able to accurately predict encoded size? (I do not know vorbis deeply) Then it could make AbS useless.
Garf
It's using a fully linear quantizer. That should help a bit, wouldn't it.

Vorbis bitrate control is based on having multiple psy tuning curves corresponding to multiple expected bitrates, and binary searching/interpolating between them to find a psy setting that produces the wanted bitrate.
Gabriel
QUOTE
It's using a fully linear quantizer. That should help a bit, wouldn't it.

Of course, it helps a lot.

QUOTE
Vorbis bitrate control is based on having multiple psy tuning curves corresponding to multiple expected bitrates, and binary searching/interpolating between them to find a psy setting that produces the wanted bitrate.

I understand now.
Fine for vbr, but would that means that in cbr the vorbis encoder would have to run several times its psy model?
Garf
QUOTE(Gabriel @ Mar 28 2006, 04:31 PM)
Fine for vbr, but would that means that in cbr the vorbis encoder would have to run several times its psy model?
*



Which explains why managed bitrate modes in Vorbis are a magnitude slower than quality based VBR ones smile.gif
takehiro
QUOTE(Gabriel @ Mar 28 2006, 08:53 PM)
I do not think that this is the reason for current Vorbis encoder to not use AbS.
There must be something else. Perhaps Vorbis is able to accurately predict encoded size? (I do not know vorbis deeply) Then it could make AbS useless.
*


I think it is because AbS is heavily patented that Vorbis avoid AbS.
xiphmont
QUOTE(Garf @ Mar 26 2006, 05:32 PM)
As far as I understood Monty doesn't believe in the entire idea of measuring (non)audible quantization noise, which would kill a major reason to do analysis-by-synthesis.
*



Correct. The noise introduced by quantization in the frequency domain is one of the great red-herrings of layer 3 encoding. It is not audible as or additive to spectral noise; it is only audible only as loss of time-locality, which the layer 3 model doesn't know how to evaluate at that stage anyway. The only obvious way frequency domain quantization becomes audible in a frequency sense is when it inflates or causes a collapse of critical band noise energy.

A noise allocation loop makes sense in layer 2 where quantization is in the time domain. It makes no sense whatsoever in layer 3 where quantization is [essentially] in frequency. Early on, someone indulged a little too heavily in cargo-culting ideas without really thinking it through and the rest of the world has been copying that concept without questioning it ever since.

[Vorbis is guilty of some cargo-culting as well. It's why floor zero, tessellated codebooks and residue 0 are no longer used]

Just because a grad student somewhere got a thesis out of an idea doen't necessarily mean it actually works well ;-)

[edit: the grad student crack is something of a snark; I don't know who originally established the AbS noise allocation loop, and it was mostly in reference to TwinVQ, not MPEG layer-foo.]

Monty
xiphmont
QUOTE(Gabriel @ Mar 28 2006, 10:31 AM)
QUOTE
It's using a fully linear quantizer. That should help a bit, wouldn't it.

Of course, it helps a lot.


Note that the quantization is part of the codebooks; our encoder uses linear quantization as initial testing showed non-linear quantization yielded no benefit (possibly a result of the fine-grained floor).

However, codebooks could be used that implement nonlinear quant if someone wanted. I know it's not part of the argument at hand, but that might be a useful factoid for someone reading.

QUOTE(Gabriel @ Mar 28 2006, 10:31 AM)
QUOTE
Vorbis bitrate control is based on having multiple psy tuning curves corresponding to multiple expected bitrates, and binary searching/interpolating between them to find a psy setting that produces the wanted bitrate.

I understand now.
Fine for vbr, but would that means that in cbr the vorbis encoder would have to run several times its psy model?
*



It depends. I've written three Vorbis encoders at this point (two are commercial and closed source). The reference codec does indeed just generate a bunch of possible packets and chooses the one closest to the target size it decides it needs, primarily because this is a very simple and easy to understand method that still yields good quality.

The closed encoder I wrote for Mercora does not do this; it adjusts the encoding model on the fly and only generates a single packet, even for CBR. The tracking method is actually very similar to reference, it just adjusts the model target of the next frame and takes what it gets instead of choosing from a choice of current frames. If the next frame is too large/too small it simply ratches up or down on model params.

Monty
xiphmont
QUOTE(takehiro @ Mar 28 2006, 12:08 PM)
QUOTE(Gabriel @ Mar 28 2006, 08:53 PM)
I do not think that this is the reason for current Vorbis encoder to not use AbS.
There must be something else. Perhaps Vorbis is able to accurately predict encoded size? (I do not know vorbis deeply) Then it could make AbS useless.
*


I think it is because AbS is heavily patented that Vorbis avoid AbS.
*



I do not use AbS because I don't need it. There are many ways to code audio, AbS is one of them, not the best one. It's appropriate in some designs, inappropriate in others. It is inappropriate in Vorbis.

Monty
Woodinville
QUOTE(xiphmont @ Mar 29 2006, 10:02 AM)
I do not use AbS because I don't need it.  There are many ways to code audio, AbS is one of them, not the best one.  It's appropriate in some designs, inappropriate in others.  It is inappropriate in Vorbis.

Monty
*



So, cycles aside, when is it inappropriate?
Gabriel
QUOTE
I don't know who originally established the AbS noise allocation loop, and it was mostly in reference to TwinVQ, not MPEG layer-foo.

I think that AbS started in CELP encoders.
I would also be interested to know who introduced it into modern frequency-based encoders, especially if there is a paper available. My guess would be either JJ or Brandenburg, but this is only a guess.

In Layer III, because of the non-linear quantizer and Huffman coding, it's quite hard to predict frame size from quantizer params, so AbS provides a brutal but working way to overcome this problem.

QUOTE
Correct. The noise introduced by quantization in the frequency domain is one of the great red-herrings of layer 3 encoding. It is not audible as or additive to spectral noise; it is only audible only as loss of time-locality, which the layer 3 model doesn't know how to evaluate at that stage anyway. The only obvious way frequency domain quantization becomes audible in a frequency sense is when it inflates or causes a collapse of critical band noise energy.

A noise allocation loop makes sense in layer 2 where quantization is in the time domain. It makes no sense whatsoever in layer 3 where quantization is [essentially] in frequency.

Well, I do not agree about the "makes no sense", but you are pointing to an interesting idea:
Doing AbS on time domain subbands, even in a transform codec.
Does anyone know about a comparison between time domain subbands based AbS and frequency subbands based AbS?
xiphmont
QUOTE(Gabriel @ Mar 30 2006, 03:35 AM)
QUOTE

A noise allocation loop makes sense in layer 2 where quantization is in the time domain. It makes no sense whatsoever in layer 3 where quantization is [essentially] in frequency.

Well, I do not agree about the "makes no sense", but you are pointing to an interesting idea:
Doing AbS on time domain subbands, even in a transform codec.
Does anyone know about a comparison between time domain subbands based AbS and frequency subbands based AbS?
*



I'm sorry, you're right, 'makes no sense' is too strong a statement. It's not AbS that I object to, it's that it's being used to produce a metric which itself is misguided. Were the metric (bit allocation according to quantization noise) eliminated as I think it should be, that also means there's so reason to use AbS to compute it. I was applying 'makes no sense' transitively.

It's not the quantization noise that should be being looked at over the vast majority of the spectrum; it's the change in energy introduced by quantization that's important. They're not the same thing.

Monty
xiphmont
QUOTE(Woodinville @ Mar 29 2006, 05:02 PM)
QUOTE(xiphmont @ Mar 29 2006, 10:02 AM)
I do not use AbS because I don't need it.  There are many ways to code audio, AbS is one of them, not the best one.  It's appropriate in some designs, inappropriate in others.  It is inappropriate in Vorbis.

Monty
*



So, cycles aside, when is it inappropriate?
*



In the case of Vorbis, I have means to compute the metrics I need directly. For that reason, I don't need AbS. Ripping out a faster direct computation to replace it with an iterated system that converges on the answer or brute forces all possibilities would be inappropriate, unless that iterated system was demonstrably superior in some way (ie, the 'metric' is actually a 'heuristic' and the iterated system produces a usefully better answer than the direct computation).

Another example might be in rapid prototyping of code. AbS is often easier to grasp/deploy. It would be a perfectly valid choice in a proof-of-concept where there are great savings in coding time at the stage of determining whether or not a design can work well enough to pursue with greater resources.

Beyond those two concrete examples, the answer could easily turn into a computational complexity and optimization essay. I think I'll take the Justice Stewart cop out of "I know it when I see it" which isn't really an answer, but it is a reasonably practical approximation of an answer.

Monty
SebastianG
QUOTE(xiphmont @ Mar 30 2006, 11:20 PM)
In the case of Vorbis, I have means to compute the metrics I need directly.  For that reason, I don't need AbS.  Ripping out a faster direct computation to replace it with an iterated system that converges on the answer or brute forces all possibilities would be inappropriate, unless that iterated system was demonstrably superior in some way (ie, the 'metric' is actually a 'heuristic' and the iterated system produces a usefully better answer than the direct computation).
*


Makes perfect sense. But I've trouble figuring out how such a direct computation is possible. We are talking about estimating the power of the quantization noise given a certain floor curve level and quantizer, right? Since there's no dithering involved (neither additive nor subtractive *hint hint*) these errors (specifically their standard deviation from zero) cannot be predicted very well. In a lucky case you get very small errors (if the samples prior quantization are already very close to the quantized ones) in the worst case you make a per-sample-error of a magnitude 0.5 times the quantizer step. This is why I believe AbS makes sense here.

Sebi
xiphmont
QUOTE(SebastianG @ Mar 31 2006, 05:40 AM)

Makes perfect sense. But I've trouble figuring out how such a direct computation is possible. We are talking about estimating the power of the quantization noise given a certain floor curve level and quantizer, right? Since there's no dithering involved (neither additive nor subtractive *hint hint*) these errors (specifically their standard deviation from zero) cannot be predicted very well. In a lucky case you get very small errors (if the samples prior quantization are already very close to the quantized ones) in the worst case you make a per-sample-error of a magnitude 0.5 times the quantizer step. This is why I believe AbS makes sense here.

Sebi
*



The core of my original argument (are we arguing past each other?) is that it makes no sense to compute this value at all; there is no such thing as additive quantization noise when quantization is performed in frequency . Vorbis does not compute it; it cares only about total bark-band before-quantization and after-quantization energy.

OTOH, the first-stage quantizer in Vorbis is about as simple as you can get (round to nearest integer). Here, the direct computation and AbS line is blurred as Vorbis does use the known outcome of quantization to find energy levels. So, it's using what could be thought of as a partial decoding step. The concepts blur. What it is not using is some sort of optimal fitting loop.

[In fact, because Vorbis dithers in frequency to maintain noise energy, the worst case magnitude error can be as high as 1.5 the quantizer step. Factor in stereo coupling, and it can/will be larger still.]
Gabriel
Monty,

You are telling us that quantization in the frequency domain produces quantization noise that can not be measured using the same methods as used in the time domain.

It seems clear to me that frequency quantization is producing time-domain smearing, but I do not really understand why you could not correctly measure/compute quantization noise in the frequency domain.

Of course, it seems a little strange to me, but considering that you are a sane guy (at least it seems so) and that current Vorbis encoder works correctly, I am quite curious about it.

As your position goes against the vast majority of common codec engineering (but could very well be right), did you encountered any reference of the same opinion as you? Your ideas must come from somewhere, right?
xiphmont
QUOTE(Gabriel @ Apr 3 2006, 05:13 PM) *

Monty,

You are telling us that quantization in the frequency domain produces quantization noise that can not be measured using the same methods as used in the time domain.


Yes. The mapping is symmetrical and not unexpected:
Quantization in the time domain shows up as spread-spectrum noise in frequency.
Quantization in the frequency domain shows up as spread-time noise.

If you quantize a spectrum to, say, +/6dB of the original values but the total energy in any given band is unchanged, you'll only perceive time effects. You don't add 6dB of noise to the spectrum on top of the noise already there. Note that the time effects will include rapid random modulation of the envelopes of pure tones as well, but we're specifically talking about noise, not tones.

Is this really not obvious? Nothing in theory or previous experimentation has ever suggested otherwise, regardless what the millions of projects originally derived from dist8 have put into code and propgated forward ;-)

Monty
Gabriel
QUOTE
Quantization in the time domain shows up as spread-spectrum noise in frequency.
Quantization in the frequency domain shows up as spread-time noise.

Of course.
QUOTE

If you quantize a spectrum to, say, +/6dB of the original values but the total energy in any given band is unchanged, you'll only perceive time effects.

It's "only" that I do not understand. To me this could only be the case on very small frequency bands.
Take a given frequency band going from coeff number N0 to M0. Now, swap the coeffs in this band, ie N1 is M0 and so on up to M1 which is N0.
I believe that while the total energy in the band is the same, frequency change will be audible.
SebastianG
QUOTE(xiphmont @ Apr 6 2006, 07:49 PM) *

Is this really not obvious?

No, I'm not sure If I understand what you mean "you quantize a spectrum to, say, +/6dB of the original values". --- I seriously think that you might be on a wrong track here, though. It's like saying "I don't care about SNR but I calculate a floor curve anyway for some reason and use a quantizer step of 1". IMHO you just get lucky most of the times (the actual SNR is close enough to what the psychoacoustic model suggests (SMR)).

Let's not forget that the DCT4 is an orthogonal linear mapping. The whole high dimensional space is just rotated. Quantization, SNR, energy ... all that stuff does'nt change that much in the rotated space. You probably overrate the representation the MDCT gives us.

Sebi
Woodinville
Uh, I'm confused, I have to admit. I'm not sure what is saying who here...

The MDCT is 1:1 and onto over the whole signal (although not per-block). Therefore, over the whole signal (or any overlapped part of it, i.e. for which the both halves of the IMDCT are done) it obeys Parsival's Theorem.

That means that the noise also obey's Parsival's theorem. As does the signal. As does everything.

So noise in transform domain == noise in time domain.

Yes? No?

Gabriel
QUOTE(Woodinville @ Apr 8 2006, 12:15 AM) *

So noise in transform domain == noise in time domain.

Yes? No?

I say yes, and it seems that Monty said no.

(however the change introduced in signal - ie noise - might have different properties in both representations

xiphmont
QUOTE(Woodinville @ Apr 7 2006, 07:15 PM) *

Uh, I'm confused, I have to admit. I'm not sure what is saying who here...

The MDCT is 1:1 and onto over the whole signal (although not per-block). Therefore, over the whole signal (or any overlapped part of it, i.e. for which the both halves of the IMDCT are done) it obeys Parsival's Theorem.

That means that the noise also obey's Parsival's theorem. As does the signal. As does everything.

So noise in transform domain == noise in time domain.

Yes? No?


Yes and No.

Parsival's theorem is in-tact, but the quantization noise in-time is not perceived like quantization noise in-frequency. I offer Vorbis as practical proof.

Assuming we have only the barest background (I know we all know this, I'm trying to figure out where we're diverging):

The ear has seperate perception hardware for frequency-based perception (pure tones/tonal color) and time-events (sudden changes in energy, such as attacks, clicks, etc). The nervous pathways are seperate (the signals are sent to the brain over seperate distinct nerve bundles) and the way the brain processes the two different kinds of signals is also seperate. Tone perception is the slower pathway, and the brain backdates events from both pathways to keep them in sync with each other and visual stimulus.

Noise perception (in the frequency domain) has been shown to have roughly bark-width resolution; that is, narrowband noise contributes energy to a critical band of about one bark-width in the ear's cochlear hardware. Just like a windowed FFT shows a pure tone as a narrow smear (not a single spectral line), the ear's frequency filterbank isn't perfect. The banks are very narrow, but there's also alot of energy leakage. You can quantize noise *very* roughly... so long as the overall energy envelope, smoothed with a roughly one-bark window (I'll gloss over the details window for now) across the spectrum is preserved . The frequency detection apparatus of the ear will not notice the difference unless the quantization is so rough that the cochlear filterbank has enough time and the remaining quantized lines are few and far enough apart that they can start being resolved as individual tones (the mp3 'sparkling' effect). So long as we avoid this, it's the final, decoded energy delivered to the filterbank that that part of the ear hears. Again, I'm neglecting how pure tones mesh with this, although they don't really complicate things much. The model is the same.

...the ear's time apparatus is perceiving a completely orthogonal audio metric; it's *excellent* at detecting abrupt energy changes and resolves the time of occurrence to better than 5 ms. If you quantize very roughly in frequency, the cochlear frequency filterbank probably won't notice (it's only detecting energy), but the time-energy detector *will* notice.... if the quantization causes enough temporal smearing of the signal.

When one quantizes in time (assume monolithic spectrum, no subbanding), you contribute spread-spectrum noise to the frequency domain. Because the HF octaves usually have low energy to begin with, this tends to be easily audible. It is detected easily by the slow cochlear filterbank because some parts of the spectrum are noticably inflated (even if total energy drops, energy somewhere increases).

When quantizing in frequency in such a way that the bark-band energy remains the same, the cochlear filterbank has no way of detecting the difference. It is designed to hear energy. If the energy is preserved, what it outputs doesn't change. However, quantization in frequency sprays spread-block noise into time. If some part of the 'time' block is much quieter, energy there is 'inflated' and the time hardware hears it as pre-echo-- an entirely different kind of 'noise'. So different that it's not called noise-- it's called 'pre-echo' :-)

Hopefully, this gets past remaining semantic confusion... perhaps we've been agreeing with each other for some time.

Monty
xiphmont
...oh, and if your blocks are large enough that even the cochlear frequency filterbank can detect the pre-echo, it sounds really bad :-) The time pathway seems to be somewhat less sensitive about changes of the quality of an attack, it seems designed to determine *when* things are happening, not specifically what (that data is backfilled from the filterbank). This last bit is still wide open research AFAIK, or at least it was in my last binge of research which was admittedly a few years ago.

Monty
Woodinville
QUOTE(xiphmont @ Apr 10 2006, 01:25 PM) *
Yes and No.

Parsival's theorem is in-tact, but the quantization noise in-time is not perceived like quantization noise in-frequency. I offer Vorbis as practical proof.



So, yes.

Perception is a separate issue, and you may assume I know a bit about perception.

None the less, the noise in the frequency domain exactly describes the noise in the time domain, and vice versa, that's a simple question of duality of transform.

You can span the space in either domain, of course simple noise injection in the two domains is quite different.

I don't think anyone has claimed that the two domains have the same perceptability, only that the two domains are duals and as a result what you can describe in one domain you can also describe in the other, of course in different form.



QUOTE(xiphmont @ Apr 10 2006, 01:36 PM) *

...oh, and if your blocks are large enough that even the cochlear frequency filterbank can detect the pre-echo, it sounds really bad :-) The time pathway seems to be somewhat less sensitive about changes of the quality of an attack, it seems designed to determine *when* things are happening, not specifically what (that data is backfilled from the filterbank). This last bit is still wide open research AFAIK, or at least it was in my last binge of research which was admittedly a few years ago.

Monty



This is not the same question. When you introduce perception you have to address the mostly-minimum-phase, nonlinear response of the hearing apparatus, which is distinctly a horse of another color.

Consider, if you have a perceptable pre-echo, you'll start to depolarize outer hair cells to start the compression process. This is going to substantially change the perception of the actual attack in an egregious case.

But this does not in any fashion address the issue of spanning the space, which is how the original question seemed to evolve. You better be able to span the same space from either domain, eh? Certainly the representations will differ.
xiphmont
QUOTE(Woodinville @ Apr 10 2006, 05:37 PM) *


I don't think anyone has claimed that the two domains have the same perceptability, only that the two domains are duals and as a result what you can describe in one domain you can also describe in the other, of course in different form.



I fear that a number of people, both doing research and writing code, do implicitly expect the perceptability to be the same or similar. I may be wrong; I hope so.

QUOTE(Woodinville @ Apr 10 2006, 05:37 PM) *


This is not the same question. When you introduce perception you have to address the mostly-minimum-phase, nonlinear response of the hearing apparatus, which is distinctly a horse of another color.

Consider, if you have a perceptable pre-echo, you'll start to depolarize outer hair cells to start the compression process. This is going to substantially change the perception of the actual attack in an egregious case.

But this does not in any fashion address the issue of spanning the space, which is how the original question seemed to evolve. You better be able to span the same space from either domain, eh? Certainly the representations will differ.


Yes, we agree completely here. I think I now see the semantic disconnect.

The transform must be a complete, unique mapping and noise introduced quantizing in one domain is also fully and uniquely described in the other; it is mathematically correct to call the error from spectral quantization 'noise'. Nevertheless, noise introduced by spectral quantization is not perceived as noise defined in the colloquial sense. This semantic disconnect appears to have confused the living daylights out of a full generation of codec engineers.

The two meanings of 'noise' represent nearer the same thing in layer 2 where quantization was in the time domain, albeit a subbanded time domain. The two meanings are practically [that is, perceptually] very different in layer 3. The noise shaping loop in mpeg layer 3 certainly behaves as if it freely conflates the two. ...or perhaps that little nugget in the ISO example code is just one of the better practical jokes of all time?

Monty
Gabriel
So what you are telling, if I understood you right, is that you think that using local frequency distortion (by subband) is not a good indicator of time domain smearing.
To me this does not imply that AbS is not working, but only that the analysis should be different if you want to check time smearing or frequency quantizatin noise.

I also think that in case of transcients it is better to focus on energy preservation (by subband) than to focus on constant SMR. (of course, as "our" usuals SMR computations are done using the assumption that the signal is steady over the window)
But under steady conditions, I really do not see how NOT computing distortion per subband could improve quality of your analysis.
Ivan Dimkovic
I have a hypothesis here:

I don't think the noise "footprint" in various domains is of issue here - under proper conditions, and good block switching algorithm + some temporal noise shaping tool, such as TNS (as in AAC) there should be no problems with noise smearing / mismatch in time domain at all (anyway much less than what our auditory system should perceive).

I think the problem that some people are seeing here is quite more complex - and it requires assesment Whether the NMR measure of "noise" to "mask threshold" is actually in accordance to the physical masking experiments done in the lab conditions

Problem with NMR itself is, that it, as the measure, describes the difference between original spectrum and requantized spectrum grouped per one frequency band - and, that said, it is unable to detect something which I call "microholes" that happen with MDCT & quantization, small periodic shutdowns of the spectral bins.

These small shutdowns lead to couple of different artifacts - "fluttering" in case of low-energy noise-like signal, "sound colloration" in case of more complex harmonic signal, etc...

So, what am I trying to say - noise measured with NMR is not "narrowband white noise superimposed (mixed) in the test signal to test the masking conditions under scientific noise masking experiments" - rather, it is just a difference of requantized MDCT spectrum and original signal - called "noise" in the signal-processing terminology.

And, since our ears do not listen to "difference" of two signals (one of which we usually do not have wink.gif in "DSP terms", this difference called "quantization noise" cannot be directly related to superimposed noise in physical experiments - it could up to some degree, but there are obvious limits and problems.

Problem of plain NMR approach is - that we have a whole range of cases where small holes in the spectrum resulted out of quantization are not deemed relevant by the noise allocating module in the perceptual codec - because NMR is a SUM of all squared differences of original and requantized signal within the said band, and holes are just small part of it if the scalefactor band is wide enough... and I have a lot of experimental proof that many people will be quite sensitive to those microholes (Guru, Juha, etc.. smile.gif

So, possible solutions for this are:

- Prevent microholes and similar harmonic distortions by some additional processing algorithm
- Overcode, so the allowed NMR is always less than a possibility to introduce microholes (not good IMO)

Interesting enough, NMR alone as the measure had quite lower correlation with SDG in the PEAQ standardization tests than the final (advanced) PEAQ version - this does not mean that NMR is bad, it just means that it alone, as the measure, obviously do not quantify the impairment very well.
Enig123
QUOTE(Ivan Dimkovic @ Apr 11 2006, 07:44 PM) *

So, possible solutions for this are:

- Prevent microholes and similar harmonic distortions by some additional processing algorithm


Ivan, did you already have ideas in mind and thus can give a more specific description of the algorithm?
Ivan Dimkovic
Unfortunately, I don't have any more information I could share. But I think my hypothesis is already a good starter for thinking in the right direction wink.gif
SebastianG
QUOTE(xiphmont @ Apr 11 2006, 06:46 AM) *

Nevertheless, noise introduced by spectral quantization is not perceived as noise defined in the colloquial sense.

It depends. If you add white gaussian noise in the frequency domain it'll be no different to adding white gaussian noise in the time domain -- no difference at all. But why are we experiencing those weird metallic sounding artefacts when we do quantization in the freq domain? (me being rhetorical) Given that the MDCT (on the whole signal) is an orthogonal mapping, the quantization error you do in the frequency domain just gets "rotated" in the inverse transform. Why should it sound differently compared to quantizing in the time domain? (again rhetorical)

Things get clearer if we really imagine a high dimensional space where every point in the space is one possible "block of audio". By quantizing each component linearly with a constant quantizer stepsize we map each point to another point on an an orthogonal grid. So, when you say it sound's differently when done in the frequency domain compared to quantizing in the time domain it must have something to do with the orientation of the grid, no? Because this is the only thing that changes by applying an orthogonal transform.

At this point I'd like to include dithering. Why do we do it (in a DSP kind of sense) again? To keep the error's samples decorrelated! (Yes, we can do stuff like noise shaping too but dithering is required nonetheless to be able to guarantee a certain degree independance of the error to the signal).

My proposition: If we do quantize a signal with proper dithering it does not matter at all in what domain we do it, it'll sound the same (real noise, no metallic sounding rubbish) on condition that we try to keep a similar noise power distibution in the time/frequency space by applying noise shaping.

Justification: Let x be a random vector (each component is a random variable independant from each other with the same standard deviation sigma). Suppose we do this mapping Ax=y where A is an orthonormal matrix. What can we say about y's components' distribuions?
1) Each component is a random variable with the same standard deviation sigma
2) There's no correlation between components in y (there might be a dependance though, in case of non-gaussian distributions).
=> We get uncorrelated white noise with a constant per-sample standard deviation in the time domain. YAY!
This is just a case for a flat noise distribution in the time/frequency space but it extends to other more practical cases as well.

So, if we quantize in the frequency domain with dithering properly applied, we're doing fine. BTW: I don't consider "noise normalization" to be dithering in the usual DSP sense. NN just tries to keep the energy of the quantized signal close to the original signal's energy whereas dithering is mainly supposed to avoid nonlinear artefacts. NN helps avoiding these a bit too but not completely.

Now, the experienced reader might argue about dithering not being helpful when it comes to compression since it slightly increases entropy. But this only applies to additive dithering. So, If I had to design the next codec I'd seriosly consider a subtractive dither. Subtractive dithering + quantization can be done simultaneously by switching randomly between different quantizers with the same step size but other offsets.
Example:
q1 = { ... -19, -15, -11, -7, -3, 1, 5, 9, 13, 17 ... }
q2 = { ... -17, -13, -9, -5, -1, 3, 7, 11, 15, 19 ... }
The quantizer (q1 or q2) will now be selected "randomly" for the current sample. A simple PRNG will do. Encoder and decoder need to know the exact state of this PRNG of course. The beauty of this example is the following: Notice that q1 = -q2. We can order the elements of q1 and q2 in ascending absolute values and assign an index to each of those starting at zero:
q1 = { q10= 1, q11=-3, q12= 5, q13=-7, q14= 9, q15=-11 ... }
q2 = { q20=-1, q21= 3, q22=-5, q23= 7, q24=-9, q25= 11 ... }
Assuming symmetrical sample distributions around zero this enables us to use the same entropy coder for both quantizers since Prob(sample=q1i) = Prob(sample=q2i) for all i. So, all the indices can be grouped and coded using a single huffman code table for example.

Hey, we might even want to go one step further. How about Trellis-coded-quantization? (Switching quantizers not only dependant on the output of the PRNG but also dependant on the previously quantized sample) This allows some clever rate/distortion optimization using the Viterbi algorithm. Benefits: Better rate/distortion ratio (approaching VQ) while still being simple to encode and decode.

Oh boy! It took some time writing this. smile.gif


Sebi
xiphmont
QUOTE(Enig123 @ Apr 11 2006, 09:23 AM) *

Ivan, did you already have ideas in mind and thus can give a more specific description of the algorithm?


In Vorbis it's the primary function of the noise normalization code. We dither additional 'noise lines' into the spectrum to avoid these holes. The practicality is, of course, in the details. Yes, this does exacerbate smearing.

QUOTE(SebastianG @ Apr 11 2006, 10:45 AM) *

QUOTE(xiphmont @ Apr 11 2006, 06:46 AM) *

Nevertheless, noise introduced by spectral quantization is not perceived as noise defined in the colloquial sense.

It depends. If you add white gaussian noise in the frequency domain it'll be no different to adding white gaussian noise in the time domain -- no difference at all.


We're never adding *gaussian* noise. Our noise is always correlated in some way unless what we're quantizing what is already gaussian noise.

QUOTE(SebastianG @ Apr 11 2006, 10:45 AM) *

At this point I'd like to include dithering. Why do we do it (in a DSP kind of sense) again? To keep the error's samples decorrelated! (Yes, we can do stuff like noise shaping too but dithering is required nonetheless to be able to guarantee a certain degree independance of the error to the signal).

My proposition: If we do quantize a signal with proper dithering it does not matter at all in what domain we do it, it'll sound the same (real noise, no metallic sounding rubbish) on condition that we try to keep a similar noise power distibution in the time/frequency space by applying noise shaping.


Even if your dithering is 'uncorrelated', your signal itself still is. You are not adding/removing decorrelated Gaussian noise to time: you are only smearing the signal temporally. Unless, of course, your original signal is also Gaussian.

Monty
Ivan Dimkovic
QUOTE

We're never adding *gaussian* noise. Our noise is always correlated in some way unless what we're quantizing what is already gaussian noise.


That is, I believe, source of the problem - MDCT quantizer does not inject uncorrelated gaussian noise and, therefore, directly relating MDCT requantized spectra's NMR to the real-world "JND" in psychoacoustic experiments is definitely not true.

However, I would still not rule NMR out as the good method of evaluation of signal quality - it does indeed have drawbacks, but equipped with additional protection tools it suits the purpose exactly as it should and with satisfactory correlation with masking for day-to-day use (and it is quite fast, too)

I think that we also now get into the more complex nature of the perceiving vs. objectively grading the distortion - as said above, we found it true that "quantization noise" (in MPEG terms) is not always equal to the "noise" injected in human psychoacoustic experiments, but there is more to that - there are other types of distortions perceived by humans that are hard to quantify with NMR.

For example, take a look at parametric coding tools, like SBR (used in mp3Pro and HE-AAC) - NMR wise, they are a total disaster - horrible. But subjective quality rating of such signals in many cases would contradict what we would measure with NMR. Careful person would find out that energy-wise, they are much more correlated. Does it mean that, band energy also should be taken into account, especially when coding very low bit-rates?

This, and many other things (micro-holes, small temporal instabilities of the signal missed by the NMR analysis, etc...) make the whole process of encoder optimizing much more fun smile.gif

It is also worth noting that traditional 2-pass iteration loop that Monty bashed smile.gif is today hardly of any use in many encoders - for example, most state of the art AAC encoders do estimate scale factors based on signal statistics, and then do small refinement (by NMR or not, this is a different issue)
Woodinville
QUOTE(xiphmont @ Apr 10 2006, 09:46 PM) *
The two meanings of 'noise' represent nearer the same thing in layer 2 where quantization was in the time domain, albeit a subbanded time domain.


But, but, the Layer 3 filterbank is also a QMF, and the MDCT is nothing more or less than a critically sampled filterbank.

They're all filterbanks, the only real question is how narrow the bands are.

And, for Layer 3, the distortion loop works great, because it controls noise as a function of frequency, which is of course the whole point to a distortion control loop.

So I don't quite see your point.

QUOTE(Ivan Dimkovic @ Apr 11 2006, 10:47 AM) *
It is also worth noting that traditional 2-pass iteration loop that Monty bashed smile.gif is today hardly of any use in many encoders - for example, most state of the art AAC encoders do estimate scale factors based on signal statistics, and then do small refinement (by NMR or not, this is a different issue)


Well... Actually, it's not that bad, but I don't think I'm at liberty to say more, and furthermore, the IP belongs to a former employer.

Things like TNS do not break anything in any real fashion, either.
Ivan Dimkovic
I also don't think it is bad (double iteration loop), but nowdays it is just quite outdated as it is computationally demanding, especially if done properly and some faster methods result with almost equal or, sometimes even better quality.

However, most of those also use some way of analysis-by-synthesis.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.