Differenciam
Apr 11 2003, 20:36
This is a simple quesiton, but one on my mind. How does a VBR codec like, um, all the ones out there I can think of like MPC and Ogg Vorbis(native VBR codecs), choose when to use a higher bitrate and when to choose a lower one? Is it how much volume in certain frequencies or the amount of sounds coming from a certain amount of instruments? What makes the bitrate go to 278k on one part of a song and 167k on another, how does the encoder 'know'? Thanks B)
tangent
Apr 12 2003, 01:46
Generally, they decide on an "allowable noise threshold", or in the case of MPC, a "signal to mask ratio". Then they use as many bits as required to just meet the required quality. Lame ABR uses a different method, calculating "perceptual entropy" and guessing on the frame size to use from there.
QHOBBES 2.0
Apr 12 2003, 17:40
I think part of it has to with volume, because when they were developing it they thought, "Hey, 320kps worth of silence is gonna be a big waste, better drop it to 64kps" or something like that.
QUOTE(QHOBBES 2.0 @ Apr 12 2003 - 06:40 PM)
I think part of it has to with volume, because when they were developing it they thought, "Hey, 320kps worth of silence is gonna be a big waste, better drop it to 64kps" or something like that.
Sounds like a myth to me. It was discovered early on that some signals are more difficult to encode than others. What followed was a way of quantifying this 'difficulty' and devising a heuristic for a minimum-bits-required strategy (format-dependent).
VBR is the only way to maintain a more or less constant level of quality without wasting bits.
EDIT: Recorded silence is usually not easy to code, because there is inaudible noise.
QUOTE(VLSI @ Apr 12 2003 - 06:17 PM)
Recorded silence is usually not easy to code, because there is inaudible noise.
imo psychoacoustic algorithms should determine this noise is inaudible and encode it as digital silence. I don't know if any codecs do that though.
QUOTE(floyd @ Apr 12 2003 - 07:54 PM)
QUOTE(VLSI @ Apr 12 2003 - 06:17 PM)
Recorded silence is usually not easy to code, because there is inaudible noise.
imo psychoacoustic algorithms should determine this noise is inaudible and encode it as digital silence. I don't know if any codecs do that though.
I think it's difficult to detect all inaudible noise. I remember listening to a LAME track that had a silent part (I couldn't hear any noise), but used a lot of bits there. I'll see if I can find it.
EDIT: A noise gate might help, but I'm not sure if it's a silver bullet.
NeoRenegade
Apr 12 2003, 20:25
If the noise falls under the ATH dB level for the specific frequency its at, it is encoded as digital silence. If it's above the ATH, it'll be kept, and random noise is like an infinitely complex signal so it's likely to take a lot of bits.
But if its pure noise, then PNS or SBR should be able to take care of reproducing it reasonably accurate, right? And at near inaudible levels, it should be rather difficult to tell the "original" noise apart from a generated noise with the same level and spectral content. I know there's an experimental PNS switch to the MPC encoder... Maybe you could play around a bit with that to see if it gives the desired effect...
sven_Bent
Apr 13 2003, 03:44
i dont think PNS and SBR are the same thing.
PNS = perceptual Nosie Substituion = generates "fake" noise insted of the real noise
SBR = spectral Band Reproduction = genrates "fake" HF singnal instead of the real HF signal
lucpes
Apr 13 2003, 05:17
QUOTE(floyd @ Apr 13 2003 - 12:54 AM)
QUOTE(VLSI @ Apr 12 2003 - 06:17 PM)
Recorded silence is usually not easy to code, because there is inaudible noise.
imo psychoacoustic algorithms should determine this noise is inaudible and encode it as digital silence. I don't know if any codecs do that though.
Most CD's have a SNR of around 60-70dB or less... this noise might get 'very audible' on good equipment, so the codecs should not encode it as digital silence. Besides (hope I'm not talking b_sht here - someone correct me if I am), what's the difference for the encoders between -50dB noise and a -50dB tonal component of a classical performance recording? Try to remove the noise using Cool Edit or Sound Forge from a quiet piece and see that all the air and openness is gone. Even 'recent' recordings use analog equipment (tape) in the first stage (ex: Eric Clapton's Unplugged), so not encoding the noise that 'is unaudible' will get you somewhere in the Blade 128 area.
Aren't we missing the point?
Surely
Constant Bitrate = Variable Perceptual Quality
Variable Bitrate = Constant Perceptual Quality
So the VBR (or CPQ) encoder does not decide how many bits to allocate in advance:
Instead it would do something like this (not exact, but it gives a picture for the layman):
1. Do I need a short block for transients (to avoid pre-echo?)
2. Which stereo mode do I need for this block?
3. Looking at the loudest spectral bands, how much masking do the frequencies present provide? Work this out for all bands and determine how much distortion would be inaudible in each band according to the psychoacoustic model of masking in human hearing.
4. Quantize the signals in each spectral band at the coarsest level that's within the allowable distortion calculated in step 3. (This is the lossy bit)
5. Compress (e.g. Huffman coding similar to zip) the information that's left.
6. Add padding (or set aside bit reservoir) if you're forced to by a format like MP3, to reach one of the set bitrates for the current frame.
7. If the bitrate exceeds the maximum allowed (320kbps for MP3), follow a strategy for discarding bits in the least audible manner possible (e.g. use unsafe stereo mode, discard the highest frequencies, or allow distortion to marginally exceed the masking threshold).
That's a very crude model for how it works, and certainly omits certain steps, but it's the general kind of approach, as I understand it (I could be wrong).
MP3 format forces certain restrictions, so without the -Y switch for example, the sfb21 scalefactor problem often forces LAME --alt-preset standard to waste bits on encoding inaudible details well below the masking threshold in the other spectral bands.
At the moment lame --alt-preset standard also sets a minimum bitrate (except for silence) at 128 kbps (I think it may be 96 kbps in Lame 3.94, which isn't recommended here - use 3.90.2). This could be defeated for near-mono recordings and/or low bandwidth vintage audio, by using the -b 32 switch and trusting the psychoacoustic model to use just enough bits. But in true stereo CD music -b 32 saves very few bits because so few frames are less than 128 kbps, and having 128 kbps fixes some problems where the psychoacoustic model might get too aggressive in allowing distortion.
Formats like MPC, however, are more flexible and are allowed to drop to very low bitrates.
Some encoders enforce Absolute Threshold Of Hearing, but they do this when examining the frequency spectrum, not the sample values. So a very low level signal buried in white noise (such as dither) will stand well above the noise floor in the frequency spectrum, but the dither itself will usually be below the threshold and won't be encoded. It's difficult to appreciate that a frequency-domain representation can easily see signals that can't be seen as a sinusoidal wave in the time-domain signal because of the noise swamping it. The spread out time-domain signal is all concentrated into a single frequency peak, while the noise (e.g. dither or tape hiss) that's hiding it is a flat broad range of frequencies, each at much lower level (so the signal sticks above it in the frequency spectrum) but with a total power summed over all frequencies that might be much greater than the repetitive sinusoidal signal.
Actually it's probably harder to make the optimum choices in Constant Bit Rate (CBR = VPQ) mode to minimise the loss of quality while rigidly sticking to an excessively low bitrate. Only high-bitrate CBR makes the job easy, and one can happily waste any excess bits by encoding things that are predicted to be inaudible according to the psychoacoustic model.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please
click here.