Help - Search - Members - Calendar
Full Version: Why DCT for MFCC?
Hydrogenaudio Forums > Hydrogenaudio Forum > Scientific Discussion
yneedshelp
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40.
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!
Woodinville
QUOTE (yneedshelp @ Aug 12 2009, 19:02) *
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40.
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!


Hmmm...

Lots missing here. Look up "prediction gain", "spectral flatness measure", and "transform gain" for starters.
hyeewang
QUOTE (yneedshelp @ Aug 13 2009, 10:02) *
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40.
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!


Although u can get 40 mfcc coefficients, it is enough for only the leading 12 to store the signal essential characteristic.

In my humble opinion,that is where dct data compression propertity reside.
Woodinville
QUOTE (hyeewang @ Aug 16 2009, 22:21) *
QUOTE (yneedshelp @ Aug 13 2009, 10:02) *
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40.
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!


Although u can get 40 mfcc coefficients, it is enough for only the leading 12 to store the signal essential characteristic.

In my humble opinion,that is where dct data compression propertity reside.



Sigh.

Some terms to look up

"transform gain"
"Diagonalization"
"matched filtering"

This may lead you both in the right direction.

To explain:

If we have a sine wave of maximum amplitude, you have something whose average amplitude (i.e. mean absolute value) is .5. While what I'm saying is not mathematically correct and is only approximate, that means that you have, say, for 16 bits, 15 bits on average per sample.

Now, if we take a 65536 point transform that happens to exactly match the sine wave in one particular basis vector (please look up transforms to see what a basis vector is!), you will have 65535 lines with ZERO information, and 1 line with an amplitude of 65536.

This gives you 16 bits above 1 (and 16 below), for a total of 32 bits in the signal representation, divided over 65536 samples.

That's hardly 15 bits per sample.

This is, of course, a massively extreme case of transform gain, in practice with windows, etc, you can not achieve this kind of gain. It does, however, explain the basic gain.

Basically log_2 (n) is a lot smaller than n for most n.
jmvalin
QUOTE (yneedshelp @ Aug 13 2009, 11:02) *
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40.
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?


The first thing to consider is that the DCT is actually just a shortcut and is equivalent to computing the FFT using the "full spectrum" (i.e. including the negative frequencies). About compression, it depends on how you see things. You just cannot obtain a smaller number of samples without throwing away information. What the DCT does however is that it concentrates the amount of information in just the first few points (i.e. with 10 MFCC, you have almost as much information as with the 40 Mel bands).
neocambell
You may find following articles useful.

Discrete Cosine Transformation (DCT)
http://www.expertcore.org/viewtopic.php?f=71&t=435

JPEG Image Compression Explained
http://www.expertcore.org/viewtopic.php?f=71&t=436
Martel
QUOTE (yneedshelp @ Aug 13 2009, 03:02) *
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40.
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!

IIRC the target of MFCC is to characterize human vocal tract properties when speaking (outputting a small set of aggregated representative data). The basic idea is that you want to separate the driving force (breath) from the signal (since this is basically the same for everyone and is thus useless information) and keep only the things that are different amog humans (so you're analyzing vocal tract "filter"). The ->log->DCT part serves, I guess, this purpose (since you usually discard the portion of the coefficients which carries the "breath" information).
It has been a long time since I last saw this so I may be wrong.
PS: shouldn't the formula be like this?
signal->FFT->magnitude spectrum (absolute values of the complex frequency samples)->square each sample->mel...
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.