Help - Search - Members - Calendar
Full Version: MSC Thesis about audio compression
Hydrogenaudio Forums > Hydrogenaudio Forum > Scientific Discussion
rutra80
I'm writing MSC Thesis about audio compression, I want to cover topics of lossless, lossy, and hybrid compression, as technically as I can. Since my knowledge is quite general, not too specialized, I'll surely need some help, that's why I'm starting this topic - to ask questions. I hope that few out of thousands great brains gathered here on HA are willing to help me a bit smile.gif

For starters, let me ask for links to resources about:

1. MP3 encoding, describing step by step the way that PCM audio stream must pass to become an MP3. I need quite detailed (but not too much! wink.gif ) informations about everything involved, like huffman coding, quantization, M/S matrixing, MDCT, filter banks, noise shaping, masking, psycho-acoustic model, bit reservoir, ATH, time/frequency domains, etc etc.

2. speech encoding, (Speex, G.72x, etc.) technologies used and differences between them and general audio/music encoders like MP3.

3. lossless/hybrid compression, the one used by FLAC, WavPack, OptimFROG, etc. What technologies are used and how they work, what makes them better for audio compression than technologies used by ZIP, RAR, etc.

Of course things like formulas or code snippets are beyond me, I'm just asking for informations on how things work and what they do.
And yes I know about wiki and that Google is my friend - I use them of course, but in case if someone is already aware of suitable resources, please spare some of my research and post.

Thank you in advence! smile.gif

P.S.: I know english & polish languages only - I won't comprehend resources in other languages.
ssamadhi97
Want us to write a couple chapters for your thesis too, while we're at it?
Garf
Putting it a bit differently, the problem is that your question is so broad that it's hard to give any answer that is going to be more useful to you than those two links:

http://google.com
http://wikipedia.org

Now, if there's any more specific questions like "I don't understand this", or "I think this works like that, am I correct", then it's going to be easy to get useful answers. But right now, hardly.


QUOTE
I'm writing MSC Thesis about audio compression
[...]
Of course things like formulas or code snippets are beyond me


Say again?
rutra80
QUOTE(ssamadhi97 @ Nov 4 2005, 11:48 PM)
Want us to write a couple chapters for your thesis too, while we're at it?
*


So far I asked for few links.
QUOTE(Garf @ Nov 5 2005, 12:03 AM)
Putting it a bit differently, the problem is that your question is so broad that it's hard to give any answer that is going to be more useful to you than those two links:

http://google.com
http://wikipedia.org

Now, if there's any more specific questions like "I don't understand this", or "I think this works like that, am I correct", then it's going to be easy to get useful answers. But right now, hardly.

I guess you're right, I've put it wrong. I wrote above what I'm going to cover, I hope to figure most of these things on my own. What I was asking for was few links which others think are good for a start. Anyway, I think I found some already...
QUOTE
QUOTE
I'm writing MSC Thesis about audio compression
[...]
Of course things like formulas or code snippets are beyond me


Say again?
*


I wrote it to avoid things like this:
user posted image
I don't know what MSC is in your countries, but here in Poland it's more about theory than practice, for practice is engineer.
If I were to dig through such formulas (and there surely is a lot of them in audio compression) and through encoders sources, I could as well write my own encoder, document it, and use for the thesis. I'm not able to do that and am not expected to be able to.
ssamadhi97
I doubt that you'll be able to get away without looking at any formulae, since you'll need to understand the basic theory behind different kinds of audio compression if you want to write about the differences between these.

For a basic understanding of how things work, you could try your luck with poking around on some codec homepages for not-too-in-depth explanations. (works well for lossless audio, for example)

For a more theoretical understanding and if you want to find useful scientific references which you can quote in your thesis, you should spend some time with search engines for scientific papers (citeseer and scholar.google.com come to mind) and look for a couple books in your library that cover the basics of (audio) compression.
robert
Psychoacoustics, facts and models; E. Zwicker, H. Fastl; Springer, ISBN 3-540-65063-6

Applications of digital signal processing to audio and acoustics; Kahrs, Brandenburg; Kluwer Academic Publishers; ISBN 0-792-38130-0
Latexxx
QUOTE(rutra80 @ Nov 5 2005, 03:28 AM)
QUOTE
QUOTE
I'm writing MSC Thesis about audio compression
[...]
Of course things like formulas or code snippets are beyond me


Say again?
*


I wrote it to avoid things like this:
user posted image
I don't know what MSC is in your countries, but here in Poland it's more about theory than practice, for practice is engineer.
If I were to dig through such formulas (and there surely is a lot of them in audio compression) and through encoders sources, I could as well write my own encoder, document it, and use for the thesis. I'm not able to do that and am not expected to be able to.
*



Somehow I believe that you are trying something you shouldn't do. I have this one comment and I won't repeat it. Don't try to write anything about audio compression unless you study signal processing as major or if you are willing to study all the mathematical formulae by yourself.
legg
QUOTE(Latexxx @ Nov 5 2005, 11:10 AM)
Somehow I believe that you are trying something you shouldn't do. I have this one comment and I won't repeat it. Don't try to write anything about audio compression unless you study signal processing as major or if you are willing to study all the mathematical formuae by yourself.
*



I second that.

A suggested reading for a solid start:
Introduction to digital audio coding and standards. Marina Bosi.

A somewhat good book if you don't intend to see equations:
The MPEG handbook, 2nd edition. Watkinson.

If you're more into psychoacoustics:
Psychoacoustics, facts and models. Zwicker and Fastl.

A subscription to AES and IEEE is almost a must.
rutra80
That's what I needed, thank you robert, ssamadhi97, and legg smile.gif
Woodinville
For a summary of perceptual coding, there's Jayant, Johnston, and Safranek, from a while ago.

Good, tutorial paper.

HotshotGG
QUOTE
1. MP3 encoding, describing step by step the way that PCM audio stream must pass to become an MP3. I need quite detailed (but not too much! wink.gif ) informations about everything involved, like huffman coding, quantization, M/S matrixing, MDCT, filter banks, noise shaping, masking, psycho-acoustic model, bit reservoir, ATH, time/frequency domains, etc etc.

2. speech encoding, (Speex, G.72x, etc.) technologies used and differences between them and general audio/music encoders like MP3.

3. lossless/hybrid compression, the one used by FLAC, WavPack, OptimFROG, etc. What technologies are used and how they work, what makes them better for audio compression than technologies used by ZIP, RAR, etc.


Math Math and Math! laugh.gif. If you are doing perceptual codecs I recommend

Psychoacoustics, facts and models; E. Zwicker, H. Fastl; Springer, ISBN 3-540-65063-6

there is a new edition that is coming out sooner or later. The filterbanks you are going to need to have some basic background knowledge in DSP, although most of the stuff here is quite advanced in terms of Discrete Signal Processing. To cover huffman coding and vector quantization you might want to look into some information theory and rate-distortion theory stuff. Speech coding is usually Coded Excited Linear Prediction, and similiar techniques (most research is usually focused on this area due to mobile networks, speech recognition, etc). Lossless is usually source coding theory and linear prediction, etc. If you feel truely enlightened you might want to read a paper by the founding father or the messiah himself that started it all Claude Shannon. He just recently passed away sad.gif

http://cm.bell-labs.com/cm/ms/what/shannon...shannon1948.pdf

Out of all of them the Digital Signal Processing baffles me the most. I have TA who is getting a masters degree in applied mathmatics and she said that one fourier transform problem took her six pages to complete! thank god for Cooley-Tukey in lesser sense. ohmy.gif. One of my coding problems is a "simpler" version of Walsh Transform.
Garf
QUOTE(rutra80 @ Nov 5 2005, 03:28 AM)
I don't know what MSC is in your countries, but here in Poland it's more about theory than practice, for practice is engineer.
If I were to dig through such formulas (and there surely is a lot of them in audio compression) and through encoders sources, I could as well write my own encoder, document it, and use for the thesis. I'm not able to do that and am not expected to be able to.
*



AFAIK in most countries engineers are MSC. I don't see how you are expecting to tackle the theorethical side if you don't want to touch any formulas. All of audio coding is DSP which is heavy on mathematics. No, you don't always need the in-depth mathematics, but going along without any isn't going to work well either.

That being said, another classic work to look at is

Painter and Spanias, Perceptual Coding of Digital Audio

which has a good general introduction and overview of current and past methods.
Garf
QUOTE(Woodinville @ Nov 6 2005, 04:21 AM)
For a summary of perceptual coding, there's Jayant, Johnston, and Safranek, from a while ago.

Good, tutorial paper.
*



Unfortunately, even Google doesn't know about it. Do you have more specifics?
p0l1m0rph1c
He probably means

Jayant, N. S., Johnston, J. D. and Safranek, R. J., “Signal compression based on models of human perception,” Proc. IEEE, Oct. 1993, pp. 1385-1422
Garf
QUOTE(p0l1m0rph1c @ Nov 6 2005, 05:59 PM)
He probably means

Jayant, N. S., Johnston, J. D. and Safranek, R. J., “Signal compression based on models of human perception,” Proc. IEEE, Oct. 1993, pp. 1385-1422
*



Thanks, found it:

http://ieeexplore.ieee.org/xpl/freeabs_all...arnumber=241504

For the original poster, also see:

http://www.hydrogenaudio.org/forums/index....showtopic=34385
legg
QUOTE(Garf @ Nov 6 2005, 10:07 AM)
QUOTE(Woodinville @ Nov 6 2005, 04:21 AM)
For a summary of perceptual coding, there's Jayant, Johnston, and Safranek, from a while ago.

Good, tutorial paper.
*



Unfortunately, even Google doesn't know about it. Do you have more specifics?
*



HE probably meant this one:

Signal compression: coding of speech, audio, text, image and video. I own a copy of it, it is more like an "mid"-depth survey.

http://www.amazon.com/exec/obidos/tg/detai...=books&n=507846
Gabriel
Google accidentally cached it:
http://66.249.93.104/search?q=cache:V_JzF0...eption%22&hl=fr
rutra80
QUOTE(HotshotGG @ Nov 6 2005, 08:43 AM)
The filterbanks you are going to need to have some basic background knowledge in DSP, although most of the stuff here is quite advanced in terms of Discrete Signal Processing. To cover huffman coding and vector quantization you might want to look into some information theory and rate-distortion theory stuff.  Speech coding is usually Coded Excited Linear Prediction, and similiar techniques (most research is usually focused on this area due to mobile networks, speech recognition, etc). Lossless is usually source coding theory and linear prediction, etc.
*


Thanks for that Hotshot.
QUOTE(Garf @ Nov 6 2005, 05:05 PM)
AFAIK in most countries engineers are MSC. I don't see how you are expecting to tackle the theorethical side if you don't want to touch any formulas. All of audio coding is DSP which is heavy on mathematics. No, you don't always need the in-depth mathematics, but going along without any isn't going to work well either.
*


Here you can be either an engineer, MSC, or MSC engineer (and of course there are things like prof, doc, or bachelor which I already have). Now I'm going to get MSC only (not MSC engineer).
Maybe I was a bit too humble in the original post - I'm not that totally newb with signal processing. In fact I was studying "computer science & econometrics" with specialisation of "computer systems & networks" (instead of "multimedia" which now after 5 years I know I should choose instead), but what fascinates me for more than 10 years and what I am playing with at home, is stuff connected with audio processing (that's why I'm an active member of this forum after all). So, I know this and that about things mentioned in the original post, but there's a difference (and space) between knowing what engine consists of and what for given parts are in it, and between knowing all the physical & chemical rights & formulas standing behind it. It's all about getting a bit deeper into that ...but not to deep. You devs have knowledge worth of a good doc, so don't expect too much from a citizen who just wants to get his MSC. Long story short - no more destructive posts like "you are trying something you shouldn't do", please [:
QUOTE(Gabriel @ Nov 6 2005, 07:57 PM)

Is there online a free original version available (in PDF, PS, or something)?

Thanks again to everyone who posted something constructive smile.gif
ssamadhi97
random hints: try Tony Robinson's paper on Shorten for basics of lossless and lossy compression; try FLAC and Monkey's Audio homepages for more general / less theoretical explanations and methods to improve compression (all from the top of my hat, so there might be better resources - do some research! wink.gif).

QUOTE(rutra80 @ Nov 6 2005, 11:56 PM)
Is there online a free original version available (in PDF, PS, or something)?
*


Highly doubtful. Try getting it via your library, for example. And sign me up for a copy -- errr...
HotshotGG
QUOTE
"computer science & econometrics" with specialisation of "computer systems & networks" (instead of "multimedia" which now after 5 years I know I should choose instead), but what fascinates me for more than 10 years and what I am playing with at home, is stuff connected with audio processing


In my school you can major or get a B.S in Computer Science or Electrical Engineering and minor in a seperate area called Sound Recording Technology. It's cross-disclipinary field. The regular SRT major is for a Bachelor Of Music Degree though unfortunatly really more music oriented. The SRT department has about ten different courses and one of them is Psychoacoustics and Physical Acoustics, etc. I am taking a class in coding in C this semester myself just for spite, although I don't really want to be a CS major that much. I have seen enough coding problems to last me a lifetime ;-D. It's a great to learn about as hobby in your spare time though IMO. "Multimedia" stuff is usually the "creative" side. It's a little to "artsy" for me graphic designers and web developers love that stuff. CS is great a diverse field, but it's a little cut and dry for most people. Audio signal processing is mostly engineering related. I don't know what you have in mind, but most of the focus is on different types of algorithmic implementations like circular buffer for delay, convolution for reverberation, etc.
rutra80
Exactly what kind of dB is used in most audio processing apps that the loudest = 0dB, and quieter = negative dB values? What's the anchor?

EDIT: Nevermind, already figured that the anchor = maximum amplitude which can be handled before clipping occurs rolleyes.gif
Mike Giacomelli
QUOTE(rutra80 @ Nov 20 2005, 09:08 PM)
Exactly what kind of dB is used in most audio processing apps that the loudest = 0dB, and quieter = negative dB values? What's the anchor?

EDIT: Nevermind, already figured that the anchor is a maximum amplitude which can be handled before clipping occurs rolleyes.gif
*



The anchor is the maximum intensity (0dB).

Edit: I really should read your edits before I post.
rutra80
Heh, thanks anyway smile.gif
rutra80
More about dB...

Lets say I have 3 sounds of the same frequency:

A - quiet (an anchor)
B - medium
C - loud

B is perceived 2 times louder than A if B has 10 times more sound intensity than A.
B is perceived 2 times louder than A if B has ~3 times more sound pressure than A.

C is perceived 2 times louder than B if C has 10 times more sound intensity than B.
C is perceived 2 times louder than B if C has ~3 times more sound pressure than B.

C is perceived 4 times louder than A if C has 100 times more sound intensity than A.
C is perceived 4 times louder than A if C has 10 times more sound pressure than A.

Would that be correct?
If so, is it safe to say that our ears are more sensitive to slight changes of sound pressure when sound pressure is low, than to slight changes of sound pressure when sound pressure is high? I mean, the change of sound pressure from 21uPa to 22uPa will be more noticable than the change from 2221uPa to 2222uPa, right?
And is amplitude of acoustic wave lineary proportional to sound pressure?
smack
At the risk of bothering you with obvious things: the articles at Wikipedia about sound measurement seem to be quite good. You can start at sound pressure - the other articles are linked there.

Be careful not to mix up things like "sound pressure" (Pa) and "sound pressure level" (dB), "sound intensity" (W/m^2) and "sound intensity level" (dB) etc.
jmvalin
QUOTE(rutra80 @ Nov 22 2005, 02:07 PM)
Lets say I have 3 sounds of the same frequency:


If you only have one frequency, it's a sinusoid and boring to study in the first place.

QUOTE(rutra80 @ Nov 22 2005, 02:07 PM)
B is perceived 2 times louder than A if B has 10 times more sound intensity than A.
B is perceived 2 times louder than A if B has ~3 times more sound pressure than A.


There's no such thing as "perceived 2 times louder". How do you quantify that? Just in case you were even remotely tempted to extrapolate. No, you definitely can't "say twice as loud" when you double the value in dB.

QUOTE(rutra80 @ Nov 22 2005, 02:07 PM)
Would that be correct?


Definitely not.

QUOTE(rutra80 @ Nov 22 2005, 02:07 PM)
If so, is it safe to say that our ears are more sensitive to slight changes of sound pressure when sound pressure is low, than to slight changes of sound pressure when sound pressure is high? I mean, the change of sound pressure from 21uPa to 22uPa will be more noticable than the change from 2221uPa to 2222uPa, right?


That's mostly right.

QUOTE(rutra80 @ Nov 22 2005, 02:07 PM)
And is amplitude of acoustic wave lineary proportional to sound pressure?
*



The amplitude *is* the reading of the sound pressure. At least with an omni-directional mic.
rutra80
QUOTE(smack @ Nov 22 2005, 11:40 AM)
At the risk of bothering you with obvious things: the articles at Wikipedia about sound measurement seem to be quite good. You can start at sound pressure - the other articles are linked there.

Be careful not to mix up things like "sound pressure" (Pa) and "sound pressure level" (dB), "sound intensity" (W/m^2) and "sound intensity level" (dB) etc.
*


I already did read all of them. Thanks anyway.
QUOTE(jmvalin @ Nov 22 2005, 11:53 AM)
QUOTE(rutra80 @ Nov 22 2005, 02:07 PM)
B is perceived 2 times louder than A if B has 10 times more sound intensity than A.
B is perceived 2 times louder than A if B has ~3 times more sound pressure than A.

There's no such thing as "perceived 2 times louder". How do you quantify that? Just in case you were even remotely tempted to extrapolate. No, you definitely can't "say twice as loud" when you double the value in dB.

I didn't say anywhere that doubling dB = 2 x louder. If I was to say something like that I would rather say that raising dB by 10 (from 30dB to 40dB for example) raises the volume twice. I asked for confirmation that 10 times higher sound intensity = raising dB by 10, or ~3 (square root of 10) times higher sound pressure (and amplitude?) = raising dB by 10.

EDIT: Ok I proved on my own that I'm right, here is the graph I made on QuickMath:
user posted image
x = sound intensity (W/m^2) & sound pressure (Pa)
y = dB
green = sound intensity level
red = sound pressure level

For sound intensity level I used: y=10*(log(10,x/(10^-12)))
For sound pressure level I used: y=20*(log(10,x/(2*(10^-5))))

As you can more or less see, sound intensity of 1W/m^2 = 120dB, ten times more (10W/m^2) = 130dB, sound pressure of 2Pa = 100dB, sound pressure of ~6Pa (2*sqrt10) = 110dB.
rutra80
In MP3, is bit-reservoir more like long-term buffer or just a temporary cache? Lets say there are some unused bits in 1st frame - can they be used in frame 666 or only in a very next one?
ErikS
QUOTE(rutra80 @ Nov 29 2005, 05:20 AM)
In MP3, is bit-reservoir more like long-term buffer or just a temporary cache? Lets say there are some unused bits in 1st frame - can they be used in frame 666 or only in a very next one?
*


Of course it can't be saved forever. But it's a bit longer than just one frame, so neither of your alternatives is correct.
Omion
QUOTE(ErikS @ Nov 28 2005, 09:35 PM)
Of course it can't be saved forever. But it's a bit longer than just one frame, so neither of your alternatives is correct.
*


Actually, they can be saved forever, sort of...

Say you have a file which is (for example) CBR 128, and every frame uses exactly 128kbps of data. That is, each frame is "full" and the bit reservoir is not used or needed.
Now, you find a way to squeeze out 1 byte from the first frame. Using the bit reservoir, the second frame's data will start on the last byte of the first frame, and the third frame will start on the last byte of the second, etc..
The result of this is that all the audio data will be shifted up by 1 byte. You can "carry" that spare byte arbitrarily far in the file, and therefore use it in the 666th frame as you mentioned.

You can't use that exact byte in the 666th frame, but you can use the savings created by freeing up the byte anywhere after the frame.

QUOTE(rutra80)
In MP3, is bit-reservoir more like long-term buffer or just a temporary cache? Lets say there are some unused bits in 1st frame - can they be used in frame 666 or only in a very next one?

The bit reservoir is implemented as a data offset. Each frame says that its data starts from 0 to 511 bytes before the frame's header. In my example above, all frames starting from the 2nd would have an offset of 1. So if there are some unused bits in the first frame, the following frames could be shifted back to fill in the gaps.

[rant] This all stems from the fact that MP3 only has 14 frame sizes to chose from, probably to make CBR easier. But true CBR really stinks so the bit reservoir was added. Vorbis can use any size frame, and therefore doesn't use a bit reservoir. This is why CBR Vorbis is worse than CBR MP3; the MP3 is more like ABR thanks to the bit reservoir. [/rant]

PS. Even though it's called a bit reservoir, it actually counts bytes.

[edit] Added quote
rutra80
Brilliant explanation.

Just one question - since it's byte aligned, then:
- when savings are less than 8 bits, they will be wasted
- when savings are more than 4096 bits, they will be wasted (4196 bits of savings = 100 bits of waste)
- when savings are not i*8 bits, they will be wasted (12bits of savings = 4 bits of waste)

Is that correct? Or maybe in such cases the bits which I wrote that will be wasted, will in fact be used to encode the frame which was supposed to save them?

And is that offset information (I guess it takes 9 bits) kept in the frame sync fields?
Omion
QUOTE
- when savings are less than 8 bits, they will be wasted

Yup. If you're 7 bits shy of filling up the frame, those bits are wasted.

QUOTE
- when savings are more than 4096 bits, they will be wasted (4196 bits of savings = 100 bits of waste)

Exactly. However, if you save enough to go to the next smaller frame size (between 27 and 210 bytes, depending on frame size), then you can do that instead of adding to the reservoir.

For example, you could put 68 bytes of frame data in a 40kbps frame and add 26 bytes to the reservoir, or you could use a 32kbps frame and keep the reservoir the same. In general it's best to use the smallest frame size which can contain all the data, to avoid bit reservoir overflow. (this is what my MP3 Repacker does)

QUOTE
- when savings are not i*8 bits, they will be wasted (12bits of savings = 4 bits of waste)

It would probobly be better to say that if the amount of data in the frame is not i*8 bits then the last bits will be wasted. If there is 9 bits of data then it will need to be padded to 16, wasting 7 bits.

QUOTE
Or maybe in such cases the bits which I wrote that will be wasted, will in fact be used to encode the frame that was supposed to save them?

I guess you could re-encode the frame and try to fill up the last few bits, but I don't think anything does this. Trying to get an exact amount of compressed data is difficult.

QUOTE
And is that offset information (I guess it takes 9 bits) kept in the frame sync fields?

It's actually stored in the 32-byte "side info" of each frame. The side info comes right after the 4-byte header. (And yes, it does take 9 bits for MPEG-1)

NOTE: Most of these numbers are for stereo MPEG-1 files. Mono MPEG-1 and stereo MPEG-2 use 17 bytes for the side info, and mono MPEG-2 uses 9. Also, the bit reservoir is limited to 255 for MPEG-2 (8-bit field). But for the majority of MP3s in existance the numbers hold.


[Ruminating] Actually, it should be possible to overlap the frames' data. That is, have the bit reservoir point to the last data bits of the previous frame, assuming they're the same. By my calculations that would save on average 1 bit per frame. That's 0.038 kbps! Hmm... It's probably not worth it. It would only save 1KB on a 4 minute song. [/Ruminating]
SebastianG
QUOTE(Omion @ Dec 2 2005, 12:17 PM)
QUOTE
- when savings are more than 4096 bits, they will be wasted (4196 bits of savings = 100 bits of waste)

Exactly. However, if you save enough to go to the next smaller frame size (between 27 and 210 bytes, depending on frame size), then you can do that instead of adding to the reservoir.
*



Actually the bit reservoir size is 4088 bits since 511 (111111111b) is the highest possible bit reservoir pointer.

Sebi
rutra80
More questions regarding MP3:

1. I read that audio is processed by frames of 1152 PCM samples (also, at decoding, every frame is decoded to 1152 PCM samples) - do these 1152 samples consist of 576 samples from left channel & 576 samples from the right one, or 1152 samples from left & 1152 from right (giving total of 2304 samples)?
Also, are these values constants, independent of source file sample-rate (the same for 32kHz, 44.1kHz, and 48kHz)?

2. Polyphase filer bank gets these 1152 samples and strips into 32 subbands. I read that it results in 36 time-domain (PCM?) samples per subband - how? Shouldn't it be still 1152 samples per subband? After all, when I apply a band-pass filter to some audio stream, I don't get a shorter audio stream, but still an audio stream of the same size but having less frequency content. If polyphase filterbank would transform samples into frequency-domain I could understand it, but since they're told to still be in the time-domain I don't get it.

3. Time-domain samples are transformed into frequency-domain by MDCT, I understand it as it takes a number of samples and makes a kind of spectrum plot out of them, so on X axis there's frequency and on Y level. But then there are long & short blocks - short blocks have higher time resolution and lower frequency resolution, and long blocks the other way around. Again - how? Since MDCT takes a fixed number of samples, and there's only frequency & level axis (no time), how can there be higher or lower time resolution?
Maybe MDCT isn't like spectrum plot, but more like spectrogram? So there's time on X axis, frequency on Y axis, and a kind of colour notation for level? Then I could explain to myself that short blocks = higher resolution of time axis & lower resolution of frequency axis, and long blocks = lower resolution of time axis & higher resolution of frequency axis. But then it's not quite a frequency-doman...
And one more thing, I read that MDCT is lossless - it takes "n" time-domain samples and outputs "n/2" frequency coefficients, but it can be reversed without quality loss. Then why there's a need for long/short blocks if it is supposed to be lossless no matter what (and why there's no lossless codec which would use MDCT to guarantee 50% compression ratio)?

4. There's a relation that:
FrameSize [bytes] = 144 * BitRate [bps] / SampleRate [Hz] (+ padding)
It seems to be correct but I can't figure out where's 144 from?
And how to find out frame's sizes in VBR (BitRate is not constant then)?

5. Why M/S coding is lossy in MP3 (I read that in OggVorbis it's lossless)?

As you can see I'm quite lost, can someone shed some light on these things please?
smack
QUOTE(rutra80 @ Dec 5 2005, 01:53 PM)
1. I read that audio is processed by frames of 1152 PCM samples [...]

MPEG-1 Layer3 frames contain 1152 samples for each channel. This is true for all three supported sampling rates.

QUOTE
2. Polyphase filer bank gets these 1152 samples and strips into 32 subbands. I read that it results in 36 time-domain (PCM?) samples per subband - how? [...]

This is not correct. MPEG-1 Layer3 frames consist of 2 granules, each granule contains 576 samples. Processing is done for the granules, not the entire frame! The polyphase filterbank processes 32 subbands, so each subband contains 18 samples. The filterbank is not a simple stack of bandpasses! It is a Polyphase quadrature filter so the subbands are subsampled by a factor of 32 (number of bands).

QUOTE
3. Time-domain samples are transformed into frequency-domain by MDCT, I understand it as it takes a number of samples and makes a kind of spectrum plot out of them, so on X axis there's frequency and on Y level. But then there are long & short blocks - short blocks have higher time resolution and lower frequency resolution, and long blocks the other way around. Again - how?

Long and short blocks use different MDCTs, 36-tap and 3 x 12-tap respectively. The three consecutive MDCTs of short blocks have a higher resolution in time but lower resolution in frequency domain.

QUOTE
And one more thing, I read that MDCT is lossless - it takes "n" time-domain samples and outputs "n/2" frequency coefficients, but it can be reversed without quality loss. Then why there's a need for long/short blocks if it is supposed to be lossless no matter what (and why there's no lossless codec which would use MDCT to guarantee 50% compression ratio)?

Please read up on the MDCT. The consecutive blocks of data have to overlap by 50%.

QUOTE
4. There's a relation that:
FrameSize [bytes] = 144 * BitRate [bps] / SampleRate [Hz] (+ padding)
It seems to be correct but I can't figure out where's 144 from?

Well, this is magic. laugh.gif Um, no...
I'm too lazy at the moment to derive that. Please help us, SebastianG. wink.gif

QUOTE
5. Why M/S coding is lossy in MP3 (I read that in OggVorbis it's lossless)?

M/S coding is lossless. See this thread for the formulae.
robert
QUOTE(smack @ Dec 5 2005, 02:52 PM)
QUOTE
4. There's a relation that:
FrameSize [bytes] = 144 * BitRate [bps] / SampleRate [Hz] (+ padding)
It seems to be correct but I can't figure out where's 144 from?

Well, this is magic. laugh.gif Um, no...
I'm too lazy at the moment to derive that. Please help us, SebastianG. wink.gif

well, 144 * 8 = 1152 samples per channel in frame
Garf
QUOTE(smack @ Dec 5 2005, 03:52 PM)
QUOTE(rutra80 @ Dec 5 2005, 01:53 PM)

And one more thing, I read that MDCT is lossless - it takes "n" time-domain samples and outputs "n/2" frequency coefficients, but it can be reversed without quality loss. Then why there's a need for long/short blocks if it is supposed to be lossless no matter what (and why there's no lossless codec which would use MDCT to guarantee 50% compression ratio)?

Please read up on the MDCT. The consecutive blocks of data have to overlap by 50%.
*



Hint: There's also the issue of what you are actually putting in the MDCT and what you are getting out...
rutra80
QUOTE(smack @ Dec 5 2005, 03:52 PM)
MPEG-1 Layer3 frames contain 1152 samples for each channel.

Please pick a number:
1. When one frame gets decoded then you get 1152 samples for 1st channel and another 1152 samples for 2nd channel, so you get a total of 2304 samples.
2. When one frame gets decoded then you get 1152 samples for only one channel, 1152 samples for another channel are in another frame.
QUOTE
MPEG-1 Layer3 frames consist of 2 granules, each granule contains 576 samples. Processing is done for the granules, not the entire frame!

Does number of granules have anything to do with number of channels?
QUOTE(robert @ Dec 5 2005, 04:24 PM)
well, 144 * 8 = 1152 samples per channel in frame
*


Thanks, hopefully I'll figure out where's 8 from when I'll get some sleep...
QUOTE(Garf @ Dec 5 2005, 07:18 PM)
Hint: There's also the issue of what you are actually putting in the MDCT and what you are getting out...
*


I input "consecutive blocks of a larger dataset, where subsequent blocks are overlapped", so I input "output of a 32-band polyphase quadrature filter", so I input "real numbers". It outputs "real numbers", 2 times less than was inputted. So far I know that, but it doesn't explain too much. The samples that come out of polyphase filter, are they still PCM samples?
I guess I input an array of samples, every next one being further in time, but what do I get out? An array of levels, every next one being for higher frequency?

At the moment that's how I think it is:
I have following samples from polyphase filter: 1 2 3 4 5 6

Long block:
[1 2 3 4] -> MDCT -> [f(1 2 3 4)]
[3 4 5 6] -> MDCT -> [f(3 4 5 6)]

Short block:
[1 2] -> MDCT -> [f(1 2)]
[2 3] -> MDCT -> [f(2 3)]
[3 4] -> MDCT -> [f(3 4)]
[4 5] -> MDCT -> [f(4 5)]
[5 6] -> MDCT -> [f(5 6)]

Does it more or less hold water?
robert
QUOTE(rutra80 @ Dec 6 2005, 12:36 AM)
QUOTE(robert @ Dec 5 2005, 04:24 PM)
well, 144 * 8 = 1152 samples per channel in frame
*


Thanks, hopefully I'll figure out where's 8 from when I'll get some sleep...

1152 * BitRate [bps] / SampleRate [Hz] (+ padding) = Framsize in bits. if you want to know the frame size in bytes, you'll have to divide by 8 (if your machine has 8 bits per byte).
robert
QUOTE(rutra80 @ Dec 6 2005, 12:36 AM)
QUOTE(smack @ Dec 5 2005, 03:52 PM)
MPEG-1 Layer3 frames contain 1152 samples for each channel.

Please pick a number:
1. When one frame gets decoded then you get 1152 samples for 1st channel and another 1152 samples for 2nd channel, so you get a total of 2304 samples.
2. When one frame gets decoded then you get 1152 samples for only one channel, 1152 samples for another channel are in another frame.

A frame decodes to 1152 samples.
a sample of a stereo frame consists of two values, one for the left channel and one for the right channel, it's a tupel. a sample of a 5.1 MPEG-2 MC frame consists of six values.
Woodinville
QUOTE(Omion @ Dec 2 2005, 01:14 AM)
But true CBR really stinks so the bit reservoir was added. [edit] Added quote
*



The bit reservoir was in the original ASPEC design, and the only issues around introducing it were that it made Layer 3 even more competitive with Layer 2.
rutra80
QUOTE(robert @ Dec 6 2005, 02:25 AM)
1152 * BitRate [bps] / SampleRate [Hz] (+ padding) = Framsize in bits. if you want to know the frame size in bytes, you'll have to divide by 8 (if your machine has 8 bits per byte).
*


/me slaps his forehead.
Actually the first thing I thought was that it stands for bits, but then I confused myself by thinking "samples don't need to be 8-bit". I really need to take some sleep...
QUOTE(robert @ Dec 6 2005, 02:33 AM)
A frame decodes to 1152 samples.
a sample of a stereo frame consists of two values, one for the left channel and one for the right channel, it's a tupel. a sample of a 5.1 MPEG-2 MC frame consists of six values.
*


Got it. Thank you robert.
smack
QUOTE(rutra80 @ Dec 6 2005, 12:36 AM)
QUOTE
MPEG-1 Layer3 frames consist of 2 granules, each granule contains 576 samples. Processing is done for the granules, not the entire frame!

Does number of granules have anything to do with number of channels?

No, the "2" is a fixed number which applies to all MPEG-1 Layer3 streams (mono and stereo).
In MPEG-2 Layer3 LSF (low sampling frequency - 16/22.05/24 kHz) each frame consists of only one granule - that's 576 samples per frame.

QUOTE(rutra80 @ Dec 6 2005, 12:36 AM)
QUOTE(Garf @ Dec 5 2005, 07:18 PM)
Hint: There's also the issue of what you are actually putting in the MDCT and what you are getting out...

I input "consecutive blocks of a larger dataset, where subsequent blocks are overlapped", so I input "output of a 32-band polyphase quadrature filter", so I input "real numbers". It outputs "real numbers", 2 times less than was inputted. So far I know that, but it doesn't explain too much. The samples that come out of polyphase filter, are they still PCM samples?

The PQF (analysis) output consists of 32 sub-bands, each containing 18 PCM samples. You can think of the 18 samples in the sub-bands being "stretched" over the duration of the granule (576 samples). That's what "subsampled by a factor of N=32" means.

MDCT has twice as many input samples (time domain) as output samples (frequency domain). The first half of the input comes from the previous block and the second half from the current block of data. That's what "overlapping transform" means.

example - mp3 long blocks:
36-tap MDCT, block size 18 samples
input: previous+current block - 36 samples in time domain
output: 18 samples in frequency domain
The previous block is actually from the mp3 granule before the current one.

QUOTE
And how to find out frame's sizes in VBR (BitRate is not constant then)?

Each mp3 frame contains a bitrate indicator in the header. Using the lookup table for bitrates from the MPEG standard and the "144" formula the frame size can be calculated.
Garf
QUOTE(smack @ Dec 5 2005, 03:52 PM)
QUOTE
5. Why M/S coding is lossy in MP3 (I read that in OggVorbis it's lossless)?

M/S coding is lossless. See this thread for the formulae.
*



You could say that in MP3/AAC the input signal to the codec is split mid/side, whereas in Vorbis the output quantized coefficients are mapped via a square polar representation. Because Vorbis is already dealing with quantized data that transformation itself can be lossless. MP3/AAC is in the float area so there is always roundoff error.
rutra80
Yay! I think all is clear now, thank-you very much smack & Garf smile.gif
Woodinville
QUOTE(Garf @ Dec 6 2005, 02:35 AM)
MP3/AAC is in the float area so there is always roundoff error.
*



Odd, you're saying one could not use a "lifting" approach for M/S calculation and decoding?
Paul Sanders
Hi Omion,

I know this thread has been dead for a long time, but I have a question. Well, more of an observation really. It seems to me that the bit reservoir cannot ever stretch back more than one frame. The reason is that the decoder doesn't know where to look for the bits (or, rather, bytes) to be carried forward from the preceding frames if there is more than one of them (frames, that is, not bytes).

With the bit reservoir entirely contained in one frame, the calculation is simple:

CODE
start_of_data_to_be_carried_forward = start_of_frame + frame_size - next_frame->bit_reservoir;


But with two preceding frames, what do you do? Specifically, you don't know (or, at least, cannot easily find out) how much of the immediately preceding frame (let's say there are two, for the sake of argument) is available for reservoir bytes and how many bytes are being used to encode the frame itself (I hope this makes some kind of sense!). A quick glance at the LAME source code (check out maxmp3buf) supports this view, and my own experiments (with files encoded using LAME) bear this out, but I would be interested in your opinion.

<EDIT>
OK, I worked it out! If the bit reservoir stretches back, say, two frames, then *all* of the audio data in the immediately preceding frame is to be considered as part of the bit reservoir. All coded and working now smile.gif

But what a mess the MP3 spec is! Diabolical. They should have thrown MP's 1 and 2 away and started again with a proper framing scheme. And the ISO spec is just plain wrong in this regard - it talks about about main_data_end when it really means main_data_begin. Clearly the person who wrote it up did not understand - it simply would not work as specified.
</EDIT>

Incidentally, I found your posts very helpful. Up to now, we have been recording to VBR format with the bit reservoir disabled to make it easy to cut and splice our MP3 files, but this is clearly not optimal and in any case we have to deal with files imported from other sources. We are now in the process of cleaning up our act, thanks, in part, to you.

Regards,

Paul Sanders
http://www.alpinesoft.co.uk
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.