Why are 128kbit/s MP3s usually 44,100 Hz?

Topic: Why are 128kbit/s MP3s usually 44,100 Hz? (Read 10471 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Why are 128kbit/s MP3s usually 44,100 Hz?

2012-10-14 01:44:58

Since 128kbit/s MP3s usually has a low-pass filter at 16,000 Hz, and the Nyquist-Shannon theorem states that all frequencies under f/2 Hz can be totally described at an f Hz sampling rate, why not encode at 32,000 Hz? Wouldn't it save more space?

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #1 – 2012-10-14 01:53:23

Quote from: eamon123 on 2012-10-14 01:44:58

... Wouldn't it save more space?

Maybe. Maybe not. It depends. Try it with a few tracks and see. Remember to check if your encoder uses a lower cutoff than 16 KHz if it is encoding a 32 KHz sample rate signal.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #2 – 2012-10-14 04:03:02

Quote from: eamon123 on 2012-10-14 01:44:58

Since 128kbit/s MP3s usually has a low-pass filter at 16,000 Hz, and the Nyquist-Shannon theorem states that all frequencies under f/2 Hz can be totally described at an f Hz sampling rate, why not encode at 32,000 Hz? Wouldn't it save more space?

Since encoding happens in the frequency domain anyway, there isn't much savings. It'll just not encode those frequencies, which is pretty close to downsampling, but much easier to implement.

That said, at very low bitrates LAME does downsample.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #3 – 2012-10-14 08:12:22

Compreession is more effective when using 32 kHz. Quality of tonal parts of the music improves. However pre-echo issues get worse, and you don't necessarily have a 16 kHz lowpass when using 128 kbps mp3 (though staying below 16 kHz is very adequate at this bitrate).

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #4 – 2012-10-14 21:10:26

Indeed, many encoders *do* resample for low bit rates. I've never seen any that do it at 128 kb/s, but once you drop below 100 kb/s it's more common. I forgot exactly where it kicks in with LAME, but LAME does this if you set the quality low enough (V6 maybe? V7?)

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #5 – 2012-10-15 03:02:59

A 128 kbps encoding is the same size regardless of the sample rate.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #6 – 2012-10-15 05:02:38

Quote from: saratoga on 2012-10-14 04:03:02

Since encoding happens in the frequency domain anyway, there isn't much savings.

Listen to these test files and tell me what you think (24 vs 48 kHz, both with 11.5 kHz lpf):

https://rapidshare.com/files/666351554/Baba...y24.11.5lpf.mp3
https://rapidshare.com/files/2623776446/Bab...y48.11.5lpf.mp3

Quote from: halb27 link=msg=0 date=

pre-echo issues get worse

If you ask me the attack sounds as good in the 24 kHz file as the 48 kHz, and everything else sounds better. What do you make of it?

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #7 – 2012-10-15 05:38:04

Quote from: eamon123 on 2012-10-15 05:02:38

If you ask me the attack sounds as good in the 24 kHz file as the 48 kHz, and everything else sounds better. What do you make of it?

Didn't want to deal with rapidshare, but I bet at such a low bitrate the savings is more than worthwhile, since lame recommends it. Probably not so much at 128k though.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #8 – 2012-10-15 05:43:32

If you listen to tonal problems at 128 kbps, you'll find that resampling to 32 kHz helps a lot.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #9 – 2012-10-16 18:07:33

Quote from: eamon123 on 2012-10-15 05:02:38

Listen to these test files and tell me what you think (24 vs 48 kHz, both with 11.5 kHz lpf):
Quote from: halb27 link=msg=0 date=
pre-echo issues get worse

If you ask me the attack sounds as good in the 24 kHz file as the 48 kHz, and everything else sounds better. What do you make of it?

Counterintuitively, actually, that's a different situation to comparing 32 kHz and either 44.1 kHz or 48 kHz and the attack should be expected to sound as good!

MPEG-1 layer 3 uses 1152 sample frames at either 32, 44.1 or 48 kHz sampling rates
MPEG-2 layer 3 uses 576 sample frames at either 16, 22.05 or 24 kHz sampling rates

Thus, the short-block duration (in milliseconds) to handle pre-echo and transients is the same for both 24 kHz and 48 kHz.

The full frame durations are:
36 ms for 16 or 32 kHz sampling rates
26 ms for 22.05 or 44.1 kHz sampling rates
24 ms for 24 or 48 kHz sampling rates

and this duration can be divided into three short blocks of 192 samples each (and a third of the duration) to handle transients with greater time resolution for the same bitrate at the expense of worse frequency resolution for the same bitrate (very high bitrate overcomes this, but you're talking about CBR)

So, to get the maximum 50% difference in short-block length, compare 32 kHz against 48 kHz or 16 kHz against 24 kHz (or indeed 32 kHz (poorer short block) against 24 kHz (better short blocks)). Converesely, as halb27 says, tonal problem samples are helped by the longer frame durations.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #10 – 2012-10-16 18:46:15

MPEG-1 Layer 3 frames consist of 2 granules a 576 samples. So a long block at 32 kHz has 18 ms duration, a short block 6 ms.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #11 – 2012-10-16 19:39:12

Oops, thanks Robert. If I could edit my post now, I would. I missed that step out, it's the granules, not the frames that are divided by three. Nonetheless, the relative difference in short-block lengths is still that it's 50% greater length at 16 or 32 kHz than it is at 24 or 48 kHz and that it's the same for both 24 and 48 kHz.

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #12 – 2012-10-17 19:31:44

Quote from: Dynamic on 2012-10-16 18:07:33

tonal problems

I'm not really sure what those are. Could you explain that for me please?

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #13 – 2012-10-17 21:48:56

Quote from: eamon123 on 2012-10-17 19:31:44

Quote from: Dynamic on 2012-10-16 18:07:33
tonal problems

I'm not really sure what those are. Could you explain that for me please?

OK, lets do this rather thoroughly...

You can consider sounds to be of I guess three types: tonal, transient and continuous noise.

Tonal is like a whistle - a pure note, with a sharp frequency response and quite often overtones (harmonics) at multiples of the base frequency which give the timbre or character of the instrument's note. Vocally, vowel sounds are tonal and can be sung.
A chord - multiple notes played at once - is also tonal.

A transient is like a click, a cymbal or hi-hat hit or the breathy or plucked onset of a note from an instrument. (as an aside: often the type of onset gives the human as much information about the instrument as the timbre, which is why although first gen synthesizers tried mainly to reproduce timbre and overtones, later generations improved onset transients for more realism). In the frequency domain, most transients are spread over a wide spectrum (like noise) but in the time domain they are of short duration. Vocally, plosive consonant sounds and similar like p, b, f, k, t have transient nature.

Continuous noise is largely uncorrelated to previous samples, it's an essentially random signal that has components over a broad frequency spectrum. While transients are noiselike in the frequency domain - a broad spectrum with little in the way of frequency peaks - they last only a short time. A brushed snare drum or tape hiss is a good example of continuous noise. Vocally, breath sounds such as sh, ss, ff, dh/th are continuous noiselike sounds.

So the word COSINE for example
starts with a sharp transient C with a clicking noise
the long O is a tonal, singable vowel
the S is noiselike and lasts longer than the transient C without a particularly sharp onset
the I is a tonal, singable vowel
and the N is mostly tonal with a fairly abrupt but not really transient ending.
(the E is silent and modifies the vowel sound represented by letter I)

To accurately match the frequency or pitch of a slow-varying tonal signal, a long block in a transform codec, provides more points in the frequency domain, each representing a narrower frequency band. The frequency resolution can be said to be good. Because of the long duration, the time resolution is poor.

Imagine a grossly oversimplified example pretty much plucked from thin air:

If you imagine a bunch of frequency components in a tonal signal represented as decimal integer numbers, in a long block lasting 12 ms a small selection of them (12 by coincidence only) might be:

Code: [Select]

160Hz 240Hz 320Hz 400Hz 480Hz 560Hz 640Hz 720Hz 800Hz 880Hz 960Hz 1040Hz
   30   119   475   879  3049 10234  4520   960   214    53   178   422

If you need to encode them only to the nearest 100, say, you might then get

Code: [Select]

160Hz 240Hz 320Hz 400Hz 480Hz 560Hz 640Hz 720Hz 800Hz 880Hz 960Hz 1040Hz
    0   100   500   900  3000 10200  4500  1000   200   100   200   400

and being all zero in the units and tens digits, we only need to send the higher digits (hundreds, thousands, tens-of-thousands etc). This is similar to how we save bitrate in lossy encoding compared to lossless.

This gives a pretty good match for the frequency and amplitude of that tone when reconstructed, which our psychoacoustic model tells us is indistinguishable from the original on this occasion.

In a short block, there might be only a third of the number of samples, and a third of the number of frequency bins, each representing a 3-times wider frequency band but over a shorter time (e.g. 4ms), so while the frequency resolution is poor, the time resolution to represent rapid changes in the signal is good. The same bunch of frequencies over the same 12 ms is now divided into three short blocks, but instead of 12 frequency components in one long block, there are just 4 frequency components, each of which is three times broader in bandwidth, in each of three short-time blocks, lasting 4ms each.

Code: [Select]

        first 4ms       |        second 4ms       |        third 4ms
240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz
  375  3849  2022   111 |   208  4721  1898   218 |   142  5431  1468   102

If the psychoacoustic model has detected that there's a transient and requested a short block, it might well assume that the frequency spectrum is fairly broad, which is true for purely transient noiselike signals like hi-hat cymbals, and might calculate that rounding to the nearest 100, say, is enough:

Code: [Select]

        first 4ms       |        second 4ms       |        third 4ms
240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz
  400  3800  2000   100 |   200  4700  1900   200 |   100  5400  1500   100

However, there are cases where there is both a strong transient component and a strong tonal component. One example I've tested a few times is the problem sample Angels Fall First. This has a close-microphone on the right-channel guitarist's pick, producing strong clicking sounds (transients) as each string is picked. The string's notes are strongly tonal and the first string continues to sound as the next string is picked.

My guess is that the click of the pick triggers a short block to capture the short-duration sound. However, the bandwidth of each frequency bin in these three short blocks is a good deal broader now and if the same rounding accuracy (e.g. to nearest 100 in the above example) is provided, it sounds as though the frequency or amplitude of the continuous tones from the strings the sound throughout is wavering slightly.

To encode both the short time of the transient and preserve the sharp frequency spectrum of the tonal part of the signal over the whole time, a very high bitrate is required to produce finer-than-usual rounding accuracy for these broad bandwidth frequency bins to still result in fine frequency precision. This is a large part of what halb27's lame3.99.5z version does in the -Vn+ and -V0+eco settings when a short block is triggered and I think it's why it solves the Angels Fall First problem sample.

(The fact that there's a trade off between fine rounding accuracy and high time & frequency precision is a subtle mathematical point in the field of windowed overlapping Fourier Transforms, that's too advanced to explain in this context. There's some hope that the new Opus codec's band-by-band time/frequency preference will allow some frequency ranges to encode tonal components at low bitrate with poor time resolution while simultaneously providing good time resolution at low bitrate to other frequency bands.)

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #14 – 2012-10-17 22:19:06

Quote from: eamon123 on 2012-10-17 19:31:44

I'm not really sure what those are. ...

As an example for a real ugly tonal problem encode lead-voice using any VBR level you like and listen to the first 2 seconds.
(BTW there's hope for the future: robert gave me a pre-3.100 version for testing which greatly improves upon samples like this.)

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #15 – 2012-10-18 15:24:21

The problem is - how exactly do you resample a 44.1kHz signal to 32kHz so that you preserve everything up to 15999Hz perfectly?

Why are 128kbit/s MP3s usually 44,100 Hz?

Reply #16 – 2012-10-18 16:55:32

Sure you can't get 16 kHz bandwidth, but something like 15 kHz or a little bit more if you don't want to run into audible artifact issues due to resampling.
Not a big differerence however but a valid point.

Notice