Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: MSC Thesis about audio compression (Read 25979 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

MSC Thesis about audio compression

Reply #25
Quote
Lets say I have 3 sounds of the same frequency:


If you only have one frequency, it's a sinusoid and boring to study in the first place.

Quote
B is perceived 2 times louder than A if B has 10 times more sound intensity than A.
B is perceived 2 times louder than A if B has ~3 times more sound pressure than A.


There's no such thing as "perceived 2 times louder". How do you quantify that? Just in case you were even remotely tempted to extrapolate. No, you definitely can't "say twice as loud" when you double the value in dB.

Quote
Would that be correct?


Definitely not.

Quote
If so, is it safe to say that our ears are more sensitive to slight changes of sound pressure when sound pressure is low, than to slight changes of sound pressure when sound pressure is high? I mean, the change of sound pressure from 21uPa to 22uPa will be more noticable than the change from 2221uPa to 2222uPa, right?


That's mostly right.

Quote
And is amplitude of acoustic wave lineary proportional to sound pressure?
[a href="index.php?act=findpost&pid=344036"][{POST_SNAPBACK}][/a]


The amplitude *is* the reading of the sound pressure. At least with an omni-directional mic.

MSC Thesis about audio compression

Reply #26
Quote
At the risk of bothering you with obvious things: the articles at Wikipedia about sound measurement seem to be quite good. You can start at sound pressure - the other articles are linked there.

Be careful not to mix up things like "sound pressure" (Pa) and "sound pressure level" (dB), "sound intensity" (W/m^2) and "sound intensity level" (dB) etc.
[{POST_SNAPBACK}][/a]

I already did read all of them. Thanks anyway.
Quote
Quote
B is perceived 2 times louder than A if B has 10 times more sound intensity than A.
B is perceived 2 times louder than A if B has ~3 times more sound pressure than A.

There's no such thing as "perceived 2 times louder". How do you quantify that? Just in case you were even remotely tempted to extrapolate. No, you definitely can't "say twice as loud" when you double the value in dB.

I didn't say anywhere that doubling dB = 2 x louder. If I was to say something like that I would rather say that raising dB by 10 (from 30dB to 40dB for example) raises the volume twice. I asked for confirmation that 10 times higher sound intensity = raising dB by 10, or ~3 (square root of 10) times higher sound pressure (and amplitude?) = raising dB by 10.

EDIT: Ok I proved on my own that I'm right, here is the graph I made on [a href="http://www.quickmath.com/]QuickMath[/url]:

x = sound intensity (W/m^2) & sound pressure (Pa)
y = dB
green = sound intensity level
red = sound pressure level

For sound intensity level I used:  y=10*(log(10,x/(10^-12)))
For sound pressure level I used: y=20*(log(10,x/(2*(10^-5))))

As you can more or less see, sound intensity of 1W/m^2 = 120dB, ten times more (10W/m^2) = 130dB, sound pressure of 2Pa = 100dB, sound pressure of ~6Pa (2*sqrt10) = 110dB.

MSC Thesis about audio compression

Reply #27
In MP3, is bit-reservoir more like long-term buffer or just a temporary cache? Lets say there are some unused bits in 1st frame - can they be used in frame 666 or only in a very next one?

MSC Thesis about audio compression

Reply #28
Quote
In MP3, is bit-reservoir more like long-term buffer or just a temporary cache? Lets say there are some unused bits in 1st frame - can they be used in frame 666 or only in a very next one?
[a href="index.php?act=findpost&pid=346147"][{POST_SNAPBACK}][/a]

Of course it can't be saved forever. But it's a bit longer than just one frame, so neither of your alternatives is correct.

 

MSC Thesis about audio compression

Reply #29
Quote
Of course it can't be saved forever. But it's a bit longer than just one frame, so neither of your alternatives is correct.
[a href="index.php?act=findpost&pid=346153"][{POST_SNAPBACK}][/a]

Actually, they can be saved forever, sort of...

Say you have a file which is (for example) CBR 128, and every frame uses exactly 128kbps of data. That is, each frame is "full" and the bit reservoir is not used or needed.
Now, you find a way to squeeze out 1 byte from the first frame. Using the bit reservoir, the second frame's data will start on the last byte of the first frame, and the third frame will start on the last byte of the second, etc..
The result of this is that all the audio data will be shifted up by 1 byte. You can "carry" that spare byte arbitrarily far in the file, and therefore use it in the 666th frame as you mentioned.

You can't use that exact byte in the 666th frame, but you can use the savings created by freeing up the byte anywhere after the frame.

Quote
In MP3, is bit-reservoir more like long-term buffer or just a temporary cache? Lets say there are some unused bits in 1st frame - can they be used in frame 666 or only in a very next one?

The bit reservoir is implemented as a data offset. Each frame says that its data starts from 0 to 511 bytes before the frame's header. In my example above, all frames starting from the 2nd would have an offset of 1. So if there are some unused bits in the first frame, the following frames could be shifted back to fill in the gaps.

[rant] This all stems from the fact that MP3 only has 14 frame sizes to chose from, probably to make CBR easier. But true CBR really stinks so the bit reservoir was added. Vorbis can use any size frame, and therefore doesn't use a bit reservoir. This is why CBR Vorbis is worse than CBR MP3; the MP3 is more like ABR thanks to the bit reservoir. [/rant]

PS. Even though it's called a bit reservoir, it actually counts bytes.

[edit] Added quote
"We demand rigidly defined areas of doubt and uncertainty!" - Vroomfondel, H2G2

MSC Thesis about audio compression

Reply #30
Brilliant explanation.

Just one question - since it's byte aligned, then:
- when savings are less than 8 bits, they will be wasted
- when savings are more than 4096 bits, they will be wasted (4196 bits of savings = 100 bits of waste)
- when savings are not i*8 bits, they will be wasted (12bits of savings = 4 bits of waste)

Is that correct? Or maybe in such cases the bits which I wrote that will be wasted, will in fact be used to encode the frame which was supposed to save them?

And is that offset information (I guess it takes 9 bits) kept in the frame sync fields?

MSC Thesis about audio compression

Reply #31
Quote
- when savings are less than 8 bits, they will be wasted

Yup. If you're 7 bits shy of filling up the frame, those bits are wasted.

Quote
- when savings are more than 4096 bits, they will be wasted (4196 bits of savings = 100 bits of waste)

Exactly. However, if you save enough to go to the next smaller frame size (between 27 and 210 bytes, depending on frame size), then you can do that instead of adding to the reservoir.

For example, you could put 68 bytes of frame data in a 40kbps frame and add 26 bytes to the reservoir, or you could use a 32kbps frame and keep the reservoir the same. In general it's best to use the smallest frame size which can contain all the data, to avoid bit reservoir overflow. (this is what my MP3 Repacker does)

Quote
- when savings are not i*8 bits, they will be wasted (12bits of savings = 4 bits of waste)

It would probobly be better to say that if the amount of data in the frame is not i*8 bits then the last bits will be wasted. If there is 9 bits of data then it will need to be padded to 16, wasting 7 bits.

Quote
Or maybe in such cases the bits which I wrote that will be wasted, will in fact be used to encode the frame that was supposed to save them?

I guess you could re-encode the frame and try to fill up the last few bits, but I don't think anything does this. Trying to get an exact amount of compressed data is difficult.

Quote
And is that offset information (I guess it takes 9 bits) kept in the frame sync fields?

It's actually stored in the 32-byte "side info" of each frame. The side info comes right after the 4-byte header. (And yes, it does take 9 bits for MPEG-1)

NOTE: Most of these numbers are for stereo MPEG-1 files. Mono MPEG-1 and stereo MPEG-2 use 17 bytes for the side info, and mono MPEG-2 uses 9. Also, the bit reservoir is limited to 255 for MPEG-2 (8-bit field). But for the majority of MP3s in existance the numbers hold.


[Ruminating] Actually, it should be possible to overlap the frames' data. That is, have the bit reservoir point to the last data bits of the previous frame, assuming they're the same. By my calculations that would save on average 1 bit per frame. That's 0.038 kbps! Hmm... It's probably not worth it. It would only save 1KB on a 4 minute song. [/Ruminating]
"We demand rigidly defined areas of doubt and uncertainty!" - Vroomfondel, H2G2

MSC Thesis about audio compression

Reply #32
Quote
Quote
- when savings are more than 4096 bits, they will be wasted (4196 bits of savings = 100 bits of waste)

Exactly. However, if you save enough to go to the next smaller frame size (between 27 and 210 bytes, depending on frame size), then you can do that instead of adding to the reservoir.
[a href="index.php?act=findpost&pid=347136"][{POST_SNAPBACK}][/a]


Actually the bit reservoir size is 4088 bits since 511 (111111111b) is the highest possible bit reservoir pointer.

Sebi

MSC Thesis about audio compression

Reply #33
More questions regarding MP3:

1. I read that audio is processed by frames of 1152 PCM samples (also, at decoding, every frame is decoded to 1152 PCM samples) - do these 1152 samples consist of 576 samples from left channel & 576 samples from the right one, or 1152 samples from left & 1152 from right (giving total of 2304 samples)?
Also, are these values constants, independent of source file sample-rate (the same for 32kHz, 44.1kHz, and 48kHz)?

2. Polyphase filer bank gets these 1152 samples and strips into 32 subbands. I read that it results in 36 time-domain (PCM?) samples per subband - how? Shouldn't it be still 1152 samples per subband? After all, when I apply a band-pass filter to some audio stream, I don't get a shorter audio stream, but still an audio stream of the same size but having less frequency content. If polyphase filterbank would transform samples into frequency-domain I could understand it, but since they're told to still be in the time-domain I don't get it.

3. Time-domain samples are transformed into frequency-domain by MDCT, I understand it as it takes a number of samples and makes a kind of spectrum plot out of them, so on X axis there's frequency and on Y level. But then there are long & short blocks - short blocks have higher time resolution and lower frequency resolution, and long blocks the other way around. Again - how? Since MDCT takes a fixed number of samples, and there's only frequency & level axis (no time), how can there be higher or lower time resolution?
Maybe MDCT isn't like spectrum plot, but more like spectrogram? So there's time on X axis, frequency on Y axis, and a kind of colour notation for level? Then I could explain to myself that short blocks = higher resolution of time axis & lower resolution of frequency axis, and long blocks = lower resolution of time axis & higher resolution of frequency axis. But then it's not quite a frequency-doman...
And one more thing, I read that MDCT is lossless - it takes "n" time-domain samples and outputs "n/2" frequency coefficients, but it can be reversed without quality loss. Then why there's a need for long/short blocks if it is supposed to be lossless no matter what (and why there's no lossless codec which would use MDCT to guarantee 50% compression ratio)?

4. There's a relation that:
FrameSize [bytes] = 144 * BitRate [bps] / SampleRate [Hz] (+ padding)
It seems to be correct but I can't figure out where's 144 from?
And how to find out frame's sizes in VBR (BitRate is not constant then)?

5. Why M/S coding is lossy in MP3 (I read that in OggVorbis it's lossless)?

As you can see I'm quite lost, can someone shed some light on these things please?

MSC Thesis about audio compression

Reply #34
Quote
1. I read that audio is processed by frames of 1152 PCM samples [...]

MPEG-1 Layer3 frames contain 1152 samples for each channel. This is true for all three supported sampling rates.

Quote
2. Polyphase filer bank gets these 1152 samples and strips into 32 subbands. I read that it results in 36 time-domain (PCM?) samples per subband - how? [...]

This is not correct. MPEG-1 Layer3 frames consist of 2 granules, each granule contains 576 samples. Processing is done for the granules, not the entire frame! The polyphase filterbank processes 32 subbands, so each subband contains 18 samples. The filterbank is not a simple stack of bandpasses! It is a Polyphase quadrature filter so the subbands are subsampled by a factor of 32 (number of bands).

Quote
3. Time-domain samples are transformed into frequency-domain by MDCT, I understand it as it takes a number of samples and makes a kind of spectrum plot out of them, so on X axis there's frequency and on Y level. But then there are long & short blocks - short blocks have higher time resolution and lower frequency resolution, and long blocks the other way around. Again - how?

Long and short blocks use different MDCTs, 36-tap and 3 x 12-tap respectively. The three consecutive MDCTs of short blocks have a higher resolution in time but lower resolution in frequency domain.

Quote
And one more thing, I read that MDCT is lossless - it takes "n" time-domain samples and outputs "n/2" frequency coefficients, but it can be reversed without quality loss. Then why there's a need for long/short blocks if it is supposed to be lossless no matter what (and why there's no lossless codec which would use MDCT to guarantee 50% compression ratio)?

Please read up on the MDCT. The consecutive blocks of data have to overlap by 50%.

Quote
4. There's a relation that:
FrameSize [bytes] = 144 * BitRate [bps] / SampleRate [Hz] (+ padding)
It seems to be correct but I can't figure out where's 144 from?

Well, this is magic.  Um, no...
I'm too lazy at the moment to derive that. Please help us, SebastianG

Quote
5. Why M/S coding is lossy in MP3 (I read that in OggVorbis it's lossless)?

M/S coding is lossless. See this thread for the formulae.

MSC Thesis about audio compression

Reply #35
Quote
Quote
4. There's a relation that:
FrameSize [bytes] = 144 * BitRate [bps] / SampleRate [Hz] (+ padding)
It seems to be correct but I can't figure out where's 144 from?

Well, this is magic.  Um, no...
I'm too lazy at the moment to derive that. Please help us, SebastianG

well, 144 * 8 = 1152 samples per channel in frame

MSC Thesis about audio compression

Reply #36
Quote
Quote

And one more thing, I read that MDCT is lossless - it takes "n" time-domain samples and outputs "n/2" frequency coefficients, but it can be reversed without quality loss. Then why there's a need for long/short blocks if it is supposed to be lossless no matter what (and why there's no lossless codec which would use MDCT to guarantee 50% compression ratio)?

Please read up on the MDCT. The consecutive blocks of data have to overlap by 50%.
[a href="index.php?act=findpost&pid=347826"][{POST_SNAPBACK}][/a]


Hint: There's also the issue of what you are actually putting in the MDCT and what you are getting out...

MSC Thesis about audio compression

Reply #37
Quote
MPEG-1 Layer3 frames contain 1152 samples for each channel.

Please pick a number:
1. When one frame gets decoded then you get 1152 samples for 1st channel and another 1152 samples for 2nd channel, so you get a total of 2304 samples.
2. When one frame gets decoded then you get 1152 samples for only one channel, 1152 samples for another channel are in another frame.
Quote
MPEG-1 Layer3 frames consist of 2 granules, each granule contains 576 samples. Processing is done for the granules, not the entire frame!

Does number of granules have anything to do with number of channels?
Quote
well, 144 * 8 = 1152 samples per channel in frame
[a href="index.php?act=findpost&pid=347827"][{POST_SNAPBACK}][/a]

Thanks, hopefully I'll figure out where's 8 from when I'll get some sleep...
Quote
Hint: There's also the issue of what you are actually putting in the MDCT and what you are getting out...
[a href="index.php?act=findpost&pid=347864"][{POST_SNAPBACK}][/a]

I input "consecutive blocks of a larger dataset, where subsequent blocks are overlapped", so I input "output of a 32-band polyphase quadrature filter", so I input "real numbers". It outputs "real numbers", 2 times less than was inputted. So far I know that, but it doesn't explain too much. The samples that come out of polyphase filter, are they still PCM samples?
I guess I input an array of samples, every next one being further in time, but what do I get out? An array of levels, every next one being for higher frequency?

At the moment that's how I think it is:
I have following samples from polyphase filter: 1 2 3 4 5 6

Long block:
[1 2 3 4] -> MDCT -> [f(1 2 3 4)]
[3 4 5 6] -> MDCT -> [f(3 4 5 6)]

Short block:
[1 2] -> MDCT -> [f(1 2)]
[2 3] -> MDCT -> [f(2 3)]
[3 4] -> MDCT -> [f(3 4)]
[4 5] -> MDCT -> [f(4 5)]
[5 6] -> MDCT -> [f(5 6)]

Does it more or less hold water?

MSC Thesis about audio compression

Reply #38
Quote
Quote
well, 144 * 8 = 1152 samples per channel in frame
[a href="index.php?act=findpost&pid=347827"][{POST_SNAPBACK}][/a]

Thanks, hopefully I'll figure out where's 8 from when I'll get some sleep...

1152 * BitRate [bps] / SampleRate [Hz] (+ padding) = Framsize in bits. if you want to know the frame size in bytes, you'll have to divide by 8 (if your machine has 8 bits per byte).

MSC Thesis about audio compression

Reply #39
Quote
Quote
MPEG-1 Layer3 frames contain 1152 samples for each channel.

Please pick a number:
1. When one frame gets decoded then you get 1152 samples for 1st channel and another 1152 samples for 2nd channel, so you get a total of 2304 samples.
2. When one frame gets decoded then you get 1152 samples for only one channel, 1152 samples for another channel are in another frame.

A frame decodes to 1152 samples.
a sample of a stereo frame consists of two values, one for the left channel and one for the right channel, it's a tupel. a sample of a 5.1 MPEG-2 MC frame consists of six values.

MSC Thesis about audio compression

Reply #40
Quote
But true CBR really stinks so the bit reservoir was added. [edit] Added quote
[a href="index.php?act=findpost&pid=347117"][{POST_SNAPBACK}][/a]


The bit reservoir was in the original ASPEC design, and the only issues around introducing it were that it made Layer 3 even more competitive with Layer 2.
-----
J. D. (jj) Johnston

MSC Thesis about audio compression

Reply #41
Quote
1152 * BitRate [bps] / SampleRate [Hz] (+ padding) = Framsize in bits. if you want to know the frame size in bytes, you'll have to divide by 8 (if your machine has 8 bits per byte).
[a href="index.php?act=findpost&pid=347965"][{POST_SNAPBACK}][/a]

/me slaps his forehead.
Actually the first thing I thought was that it stands for bits, but then I confused myself by thinking "samples don't need to be 8-bit". I really need to take some sleep...
Quote
A frame decodes to 1152 samples.
a sample of a stereo frame consists of two values, one for the left channel and one for the right channel, it's a tupel. a sample of a 5.1 MPEG-2 MC frame consists of six values.
[a href="index.php?act=findpost&pid=347968"][{POST_SNAPBACK}][/a]

Got it. Thank you robert.

MSC Thesis about audio compression

Reply #42
Quote
Quote
MPEG-1 Layer3 frames consist of 2 granules, each granule contains 576 samples. Processing is done for the granules, not the entire frame!

Does number of granules have anything to do with number of channels?

No, the "2" is a fixed number which applies to all MPEG-1 Layer3 streams (mono and stereo).
In MPEG-2 Layer3 LSF (low sampling frequency - 16/22.05/24 kHz) each frame consists of only one granule - that's 576 samples per frame.

Quote
Quote
Hint: There's also the issue of what you are actually putting in the MDCT and what you are getting out...

I input "consecutive blocks of a larger dataset, where subsequent blocks are overlapped", so I input "output of a 32-band polyphase quadrature filter", so I input "real numbers". It outputs "real numbers", 2 times less than was inputted. So far I know that, but it doesn't explain too much. The samples that come out of polyphase filter, are they still PCM samples?

The PQF (analysis) output consists of 32 sub-bands, each containing 18 PCM samples. You can think of the 18 samples in the sub-bands being "stretched" over the duration of the granule (576 samples). That's what "subsampled by a factor of N=32" means.

MDCT has twice as many input samples (time domain) as output samples (frequency domain). The first half of the input comes from the previous block and the second half from the current block of data. That's what "overlapping transform" means.

example - mp3 long blocks:
36-tap MDCT, block size 18 samples
input: previous+current block - 36 samples in time domain
output: 18 samples in frequency domain
The previous block is actually from the mp3 granule before the current one.

Quote
And how to find out frame's sizes in VBR (BitRate is not constant then)?

Each mp3 frame contains a bitrate indicator in the header. Using the lookup table for bitrates from the MPEG standard and the "144" formula the frame size can be calculated.

MSC Thesis about audio compression

Reply #43
Quote
Quote
5. Why M/S coding is lossy in MP3 (I read that in OggVorbis it's lossless)?

M/S coding is lossless. See this thread for the formulae.
[a href="index.php?act=findpost&pid=347826"][{POST_SNAPBACK}][/a]


You could say that in MP3/AAC the input signal to the codec is split mid/side, whereas in Vorbis the output quantized coefficients are mapped via a square polar representation. Because Vorbis is already dealing with quantized data that transformation itself can be lossless. MP3/AAC is in the float area so there is always roundoff error.

MSC Thesis about audio compression

Reply #44
Yay! I think all is clear now, thank-you very much smack & Garf

MSC Thesis about audio compression

Reply #45
Quote
MP3/AAC is in the float area so there is always roundoff error.
[a href="index.php?act=findpost&pid=348032"][{POST_SNAPBACK}][/a]


Odd, you're saying one could not use a "lifting" approach for M/S calculation and decoding?
-----
J. D. (jj) Johnston

MSC Thesis about audio compression

Reply #46
Hi Omion,

I know this thread has been dead for a long time, but I have a question.  Well, more of an observation really.  It seems to me that the bit reservoir cannot ever stretch back more than one frame.  The reason is that the decoder doesn't know where to look for the bits (or, rather, bytes) to be carried forward from the preceding frames if there is more than one of them (frames, that is, not bytes).

With the bit reservoir entirely contained in one frame, the calculation is simple:

Code: [Select]
start_of_data_to_be_carried_forward = start_of_frame + frame_size - next_frame->bit_reservoir;


But with two preceding frames, what do you do?  Specifically, you don't know (or, at least, cannot easily find out) how much of the immediately preceding frame (let's say there are two, for the sake of argument) is available for reservoir bytes and how many bytes are being used to encode the frame itself (I hope this makes some kind of sense!).  A quick glance at the LAME source code (check out maxmp3buf) supports this view, and my own experiments (with files encoded using LAME) bear this out, but I would be interested in your opinion.

<EDIT>
OK, I worked it out!  If the bit reservoir stretches back, say, two frames, then *all* of the audio data in the immediately preceding frame is to be considered as part of the bit reservoir.  All coded and working now

But what a mess the MP3 spec is!  Diabolical.  They should have thrown MP's 1 and 2 away and started again with a proper framing scheme.  And the ISO spec is just plain wrong in this regard - it talks about about main_data_end when it really means main_data_begin.  Clearly the person who wrote it up did not understand - it simply would not work as specified.
</EDIT>

Incidentally, I found your posts very helpful.  Up to now, we have been recording to VBR format with the bit reservoir disabled to make it easy to cut and splice our MP3 files, but this is clearly not optimal and in any case we have to deal with files imported from other sources.  We are now in the process of cleaning up our act, thanks, in part, to you.

Regards,

Paul Sanders
http://www.alpinesoft.co.uk
I am an independent software developer (VinylStudio) based in UK