Help - Search - Members - Calendar
Full Version: exploiting repetition in music for lossy compression?
Hydrogenaudio Forums > Hydrogenaudio Forum > Scientific Discussion
sanecyclist
My first post, asking about an issue I've been wondering about for a while.

Music by its very nature tends to be rather repetitive. Obviously two chunks of a song are very unlikely to be exactly equal, unless perhaps you're looking at purely synthetic music. But surely a lot of it will be rather similar to something that went before, and I'd imagine that could be exploited for compression. Yet afaik, all the well-known lossy codecs divide a song into frames and compress only within those frames, thereby disregarding redundancies between frames. Why is that?

I'd imagine you could have a buffer with decoded samples, with the size depending on speed and memory limitations. When encoding a chunk of a song, the encoder would correlate it with the buffer to find the most similar chunk, try encoding the difference to that, and use that if it turns out smaller than encoding the chunk on its own. Has this sort of scheme been tried but found not to be worthwhile?
SamHain86
While not a programmer for the next lossy codec, I can tell you that that kind of recognition system is difficult to program. This is to my knowledge since I have never seen mathematics that would somehow locate similar sets of numbers in a seemingly random noise of numbers.

I'll give an example:
CODE
ASFLKYHBADFHNJLVFSCLVFRGVSDKGUKBYNTVADFHNJLVFSCNMZFXLDSTAKURIOPQPIREUPWPMRTWMTPOWUKAGFJNMRLTDFHNJLVF
SWOFACDFRPOT4IUKMTCZDEROUTOMCAMSLRJDFHNJLVFSTIOUMHGDKLSJERTOIHMHGBJGLPDDFHNJLVFOFPIEKRUTYMFJLPCUJDFH
NJLVFSOIYW3N4LKERPIUYIKEDMLNDFHNJLVFSJHCXVFYUITORE4PKSDOIDUHRE4KLGDFIDFHNLVFSDJHKNKLG

Can you find the repeating bits? I'll make it easier, the repeating code is DFHNJLVFS.

In this code I'll mark the parts that do repeat and look like they're repeating:
CODE
ASFLKYHBA--**DFHNJLVFS**--CLVFRGVSDKGUKBYNTVA--**DFHNJLVFS**--CNMZFXLDSTAKURIOPQPIREUPWPMRTWMTPOWUKA
GFJNMRLT--**DFHNJLVFS**--WOFACDFRPOT4IUKMTCZDEROUTOMCAMSLRJ--**DFHNJLVFS**--TIOUMHGDKLSJERTOIHMHGBJG
LPD--**DFHNJLVF**--OFPIEKRUTYMFJLPCUJ--**DFHNJLVFS**--OIYW3N4LKERPIUYIKEDMLN--**DFHNJLVFS**--JHCXVFY
UITORE4PKSDOIDUHRE4KLGDFI--**DFHNLVFS**--DJHKNKLG

Now which ones are the real repeaters, and which ones look like it?

I once wished for this kind of compression. And yes I wrote an algorithm that would search for these repetitions in strings... it is insanity unless there are magical mathematics to locate this.
sanecyclist
It wouldn't need to find an exact match. A close one would be fine, as long as the difference compresses better (using the usual tools in MP3&Co.) than the actual signal. I think the magic mathematics for finding a similar signal is cross correlation.
SamHain86
That is actually neat. I minored in mathematics in college, but statistics is not my strong suit.

As I am still seeing it, one would need to know the general region to sample the repeating segments to use this correlation formula.

Again statistics does not make much sense to me, but anyone else more fluent care to continue?
muaddib
Findinf exact match is easier than finding similar match. Time needed for cross correlation for comparing sequence A against sequence B is short, but if you don't know the length of sequence and you don't know where sequence starts then you must try every sequence length and position that make sense. Since there is large number of such possible starting points and lengths in music you would need to run too many cross corelation which would lead to uselessly slow searching for similarities.
You can try setting constant sequence length (lets say equal to frame length) and fixed sequence starts (i.e. frame beginning) but then it would be very very very rare that frame A is "correlated" well enough to frame B.
Then there is also problem that simple correlation would not be good thing for audio. You should find some other similarity measure which should be connected to psychoacoustics and coding used in encoder.
This is just to complicated and thus never used.
Who would anyhow need encoder which runs 0.00001x realtime?
Mike Giacomelli
QUOTE(SamHain86 @ Mar 7 2007, 08:45) *

While not a programmer for the next lossy codec, I can tell you that that kind of recognition system is difficult to program. This is to my knowledge since I have never seen mathematics that would somehow locate similar sets of numbers in a seemingly random noise of numbers.

I'll give an example:
CODE
ASFLKYHBADFHNJLVFSCLVFRGVSDKGUKBYNTVADFHNJLVFSCNMZFXLDSTAKURIOPQPIREUPWPMRTWMTPOWUKAGFJNMRLTDFHNJLVF
SWOFACDFRPOT4IUKMTCZDEROUTOMCAMSLRJDFHNJLVFSTIOUMHGDKLSJERTOIHMHGBJGLPDDFHNJLVFOFPIEKRUTYMFJLPCUJDFH
NJLVFSOIYW3N4LKERPIUYIKEDMLNDFHNJLVFSJHCXVFYUITORE4PKSDOIDUHRE4KLGDFIDFHNLVFSDJHKNKLG

Can you find the repeating bits? I'll make it easier, the repeating code is DFHNJLVFS.

In this code I'll mark the parts that do repeat and look like they're repeating:
CODE
ASFLKYHBA--**DFHNJLVFS**--CLVFRGVSDKGUKBYNTVA--**DFHNJLVFS**--CNMZFXLDSTAKURIOPQPIREUPWPMRTWMTPOWUKA
GFJNMRLT--**DFHNJLVFS**--WOFACDFRPOT4IUKMTCZDEROUTOMCAMSLRJ--**DFHNJLVFS**--TIOUMHGDKLSJERTOIHMHGBJG
LPD--**DFHNJLVF**--OFPIEKRUTYMFJLPCUJ--**DFHNJLVFS**--OIYW3N4LKERPIUYIKEDMLN--**DFHNJLVFS**--JHCXVFY
UITORE4PKSDOIDUHRE4KLGDFI--**DFHNLVFS**--DJHKNKLG

Now which ones are the real repeaters, and which ones look like it?



Look up what the field of bioinformatics does. Its been a huge and fruitful area of research for a long time. Its actually pretty easy to do this; for instance the NCBI database can search for a thousand line pattern (or statistically similar variants of it) in the 65 billion base pair Genebank database in a couple seconds. Depending on accuracy, you can do a lot faster too.

QUOTE(SamHain86 @ Mar 7 2007, 08:45) *

I once wished for this kind of compression. And yes I wrote an algorithm that would search for these repetitions in strings... it is insanity unless there are magical mathematics to locate this.


Well, i've got a few textbooks on it if you're really interested. "Bioinformatics" by David Mount is a good start, it introduces a lot of the core algorithms.

I think the real reason this isn't done is because it'd be too slow for decoders. Who wants a decoder that can randomly seek across the entire 5MB file each second? Unless you've got MBs worth of cache on your Ipod, you'd never be able to decode it in real time out of a 16 bit SDRAM. And generally people aren't interested in audio formats that require PC class hardware, just to save a few kb/s of disk space.
sanecyclist
muaddib, yes, sequence length would definitely need to be fixed to frame length. And like you say, fixing the position wouldn't make sense, i.e. the frame would have to be correlated the entire buffer. I don't know how long the buffer would need to be to be useful, but I'd guess a few beats might already give decent results. Then the question becomes: how slow is FFT-accelerated cross correlation? 0.00001x realtime seems rather over-pessimistic.

Perhaps cross correlation wouldn't find the ideal match, but it should get pretty close anyway. Psychoacoustics would certainly come in when encoding the difference to the match, which would need to be done with reference to the levels in the actual signal, so as not too waste bits on tiny differences.

(Btw, video compression of course does something similar in P and B frames.)
sanecyclist
QUOTE
I think the real reason this isn't done is because it'd be too slow for decoders.

I don't think it would be significantly slower, not for straightforward playback anyway. The only extra complexity is adding the decoded difference signal to the reference sequence from the buffer, which is trivial.

Seeking would be slowed down a lot, but many MP3 players don't even support that anyway.

QUOTE
Unless you've got MBs worth of cache on your Ipod

With 44.1kHz 16-bit stereo you get about 6 seconds in one megabyte, which might already be enough for decent results.

QUOTE
And generally people aren't interested in audio formats that require PC class hardware, just to save a few kb/s of disk space.

That's why I was asking. Has anyone tried this sort of thing to see how much compression can be had there?
spoon
There are many dance tracks which could have 1000:1 compression with this method wink.gif

Looking at an FFT trace would be more revealing. The trouble is matching what is obviously the same piece of repetition (even if it was an instrument), where there are subtle differences in each repetition.

The repetition could be removed from the singal, and what is left is compressed with a normal codec.
Mike Giacomelli
QUOTE(sanecyclist @ Mar 7 2007, 10:27) *

QUOTE
I think the real reason this isn't done is because it'd be too slow for decoders.

I don't think it would be significantly slower, not for straightforward playback anyway. The only extra complexity is adding the decoded difference signal to the reference sequence from the buffer, which is trivial.



Its not trivial if it's not in cache. Have you considered the effects of locality on performance? I think not.

QUOTE(sanecyclist @ Mar 7 2007, 10:27) *

Seeking would be slowed down a lot, but many MP3 players don't even support that anyway.


Seeking wouldn't be any different.

QUOTE(sanecyclist @ Mar 7 2007, 10:27) *

QUOTE
Unless you've got MBs worth of cache on your Ipod

With 44.1kHz 16-bit stereo you get about 6 seconds in one megabyte, which might already be enough for decent results.


Ipods do not have 1MB of cache. That was essentially the point I'm getting at. MBs worth of cache means you're on a desktop CPU.

On an Ipod, you have 8KB of L1 cache (which will be mostly filled with code segments) and some additional IRAM which may or may not be available. How much do you think you could fit into a couple KB worth of memory?
eofor
QUOTE(spoon @ Mar 7 2007, 18:50) *

There are many dance tracks which could have 1000:1 compression with this method wink.gif


It would be even better for Coldplay fans, you'd only need to analyze one track and just store the tiny differences to store their last three albums!

(grabs coat, leaves)
knutinh
My University professor used to claim that the lower bound for lossless image compression was the storage needed for a unique ID pointing to that image in a database.

That is, of course, if you only want to compress allready existing images. Turns out that if you estimate the number of pictures out there to be reasonably small and that they all have lots of megapixels this isnt all that bad.

I guess this only expresses that the "room" occupied by pixels in an image is really really large, and a large percentage of that room does not occur in practice. The problem with lossless compression is finding that pattern.

Moreover, for many uses, one could model the human understanding on a very high level. For instance, a car within a picture may be percieved as the same, even if a different image of a similar car, featuring very different pixel values is pasted in. I guess the audio analogy would be MIDI-files, where the "distortion" is insane, but it is still quite possible to hear what song it is and it might even be pleasant. And the real information (besides sample-set) is only a few kb of keyboard-times/velocities.

Now, well. Isnt the mpeg4 synthetic audio what really was supposed to exploit this kind of things? I mean, as long as 99% of all pop music is produced by essentially trigging samples at given points in time from a MIDI sequencer, if you have access to the "score-to-audio" rendering, then you have a lot of info on the statistical properties of the audio track?


BTW, doesnt any lossy sound codecs have any "training" properties that ensures that large-scale statistical patterns are somehow factored in?


What is the "lower pattern repeat frequency" of lempel-ziv etc? I am guessing that if I paste two copied word documents where one has a few altered letters, then Winzip and Winrar will compress quite nicely compared to a single file?

-k
Dynamic
QUOTE(knutinh @ Mar 8 2007, 15:48) *

What is the "lower pattern repeat frequency" of lempel-ziv etc? I am guessing that if I paste two copied word documents where one has a few altered letters, then Winzip and Winrar will compress quite nicely compared to a single file?


Not sure, but in my experience, ZIP takes no account of redundancy among multiple similar files and compresses each separately. I believe that zipping an uncompressed single archive file (such as .TAR.GZ common in Unix) or using CAB files takes advantage of inter-file redundancy.

In fact I have a couple of folders, one with 50 to 200 similarly laid-out Excel files and one with roughly as many accounting software backup files which include incrementally more information, both of which compress to a CAB the size of about 2 of the original Excel or Accounts files, clearly taking account of the vast similarities among files.

Returning to repetition in music, however...

Thinking more about lossless compression than lossy of the thread's title, I guess it's conceivable that autocorrelation (cross-correlation of a signal with itself to spot the period of repeating patterns) could be used to seed a predictor. This differs from the usual short-range fitting function predictor that I believe is employed in lossless compression schemes (and Wavpack's hybrid lossy), but might help to reduce the residuals that have to be stored by providing an additional longer-range predictor to help out the usual predictor if the music suits it.

With around 120 bpm rhythms, one might assume percussive hits every 0.5 seconds (88200 bytes of stereo PCM data in 44.1kHz/16bit), so this is much more than the 8kB L1 cache Mike pointed out is typical of an iPod, let alone considering longer term repetition such as every 12-bars (around 25-35 seconds) in twelve bar blues, for example.

Storing in much lower resolution or encoding only the important transients in detail might just help for lossless codecs. I'd imagine that lower resolution wouldn't be too helpful an approach. I guess you could detect such transients or prioritise the most important moments for lossless compression by finding out where the conventional lossless residuals are largest, or perhaps by autocorrelating the sequence of conventional lossless residuals in a second pass, correlating them over a discrete range of time delays (usually denoted by variable s in the mathematics of digital correlation) of the the order of seconds.

One might also detect positive and negative autocorrelations related to delayed and attenuated reflections of the direct sound of instruments from walls etc. that create ambience (or artificial ambience) unless the recording is to sound very "dry". In fact this might also work by picking up cross-correlation peaks between Left and Right channels or Mid and Side channels, rather than only autocorrelation of each channel with itself.
The first ambient reflections are typically 25 milliseconds to 50 ms from the original sound if they don't tend to audibly comb-filter the sound, or from a few ms onwards if they might create a little coloration. A fairly constant ambience might provide some useful prediction.

Instinct tells me not to anticipate huge gains to lossless compression in the majority of music. Of course, it's possible that this is already done in lossless schemes that aren't really compatible with digital audio players like the iPod.

Thinking more about lossless compression than lossy of the thread's title, I guess it's conceivable that autocorrelation (cross-correlation of a signal with itself to spot the period of repeating patterns) could be used to seed a predictor. This differs from the usual short-range fitting function predictor employed in lossless compression schemes (and Wavpack's hybrid lossy), but might help to reduce the residuals that have to be stored by providing an additional longer-range predictor to help out the usual predictor if the music suits it.

With around 120 bpm rhythms, one might assume percussive hits every 0.5 seconds (88200 bytes of stereo PCM data in 44.1kHz/16bit), so this is much more than the 8kB L1 cache Mike pointed out is typical of an iPod, let alone considering longer term repetition such as every 12-bars (around 25-35 seconds) in twelve bar blues, for example.

Storing in much lower resolution or encoding only the important transients in detail might just help for lossless codecs. I'd imagine that lower resolution wouldn't be too helpful an approach. I guess you could detect such transients or prioritise the most important moments for lossless compression by finding out where the conventional lossless residuals are largest, or perhaps by autocorrelating the sequence of conventional lossless residuals in a second pass, correlating them over a discrete range of time delays (usually denoted by variable s in the mathematics of digital correlation) of the the order of seconds.

One might also detect positive and negative autocorrelations related to delayed and attenuated reflections of the direct sound of instruments from walls etc. that create ambience (or artificial ambience) unless the recording is to sound very "dry". In fact this might also work by picking up cross-correlation peaks between Left and Right channels or Mid and Side channels, rather than only autocorrelation of each channel with itself.
The first ambient reflections are typically 25 milliseconds to 50 ms from the original sound if they don't tend to audibly comb-filter the sound, or from a few ms onwards if they might create a little coloration. A fairly constant ambience might provide some useful prediction.

Instinct tells me not to anticipate huge gains to lossless compression in the majority of music. Of course, it's possible that this is already done in lossless schemes that aren't really compatible with digital audio players like the iPod.
smack
QUOTE(Dynamic @ Mar 9 2007, 06:34) *
Not sure, but in my experience, ZIP takes no account of redundancy among multiple similar files and compresses each separately. I believe that zipping an uncompressed single archive file (such as .TAR.GZ common in Unix) or using CAB files takes advantage of inter-file redundancy.
This is quite a common feature of file archivers. For instance 7-Zip and RAR can create "solid archives" (.7z and .rar, respectively) in order to exploit the combined redundancy inherent in similar files.
sanecyclist
Using this for lossless compression might well be worth a try, but I was specifically talking about lossy, based on the idea that for each frame, the difference to a previous chunk of the song might be more compressible (using the usual lossy tools) than the frame by itself.

I don't see why cache size would be critical for decoding such a scheme. Basically each frame would have a field to say what previous chunk in the buffer of previously decoded samples it needs to be added to after decoding. So for each new sample only one sample would need to be retrieved from the buffer. At 44kHz that's a non-issue.
Mercurio
I know only a bit about audio compression, but maybe working in the frequency domain, as usually is done, can be thought (more or less) close to look for "repetitions".

Yep, the best would be a good wave to "midi" converter!
pest
In my opinion finding matches in the temporal domain introduces lot of problems.
First you've got to find a good psychoaccoustic criterion for the best match.
It's verly likley that the removed match will distort the spectrum used for further psychoaccoustic analysis.
Another way of doing something similar is using previous blocks for entropy encoding.
But I supect the enlarged decoding delay is not worth the possible benefit.


knutinh
One relevant question:
To us human beings, the "repetition frequency" of music is obvious:
1. The primal and harmonics of instruments used
2. Modulation/vibrato applied to those instruments
3. percussive patterns
4. 1/4-note, one bar,
5. A blues 12-bar section
6. Verse/refrain sections, AABA structures

But on what level does this repetition actually take place? If I do autocorrelation, how good a fit do I get between verse#1 and verse#2? Does it take the perceptual processing of my mind to get the really good correlation that I feel between the two? And if they sound "similar" to me, even though they are quite dissimilar on a sample-by-sample basis, does this mean that to me as an observer one could simply substitute one for the other for a 2:1 compression with virtually no perceptible loss?

Psychoacoustical models taken one step higher?

-k
Dynamic
QUOTE(knutinh @ Mar 9 2007, 11:51) *

But on what level does this repetition actually take place? If I do autocorrelation, how good a fit do I get between verse#1 and verse#2?


Good question to which I don't know the answer. I'd imagine that naturally produced music will often have an autocorrelation peak not perfectly aligned with the sampling period.

Something like a drum hit or a transient like the plucking of a guitar or bass string is likely to correlate pretty well assuming the microphone placement remains constant. In the mix or the mastering, drums and bass are both likely to have been dynamically compressed somewhat for more consistent loudness. This should improve correlation.

I'd imagine that violins, vocal vibrato, tremelo and pitch-bending effects are likely to vary too much to provide strong correlation.
Woodinville
QUOTE(SamHain86 @ Mar 7 2007, 08:45) *

While not a programmer for the next lossy codec, I can tell you that that kind of recognition system is difficult to program. This is to my knowledge since I have never seen mathematics that would somehow locate similar sets of numbers in a seemingly random noise of numbers.

I'll give an example:
CODE
ASFLKYHBADFHNJLVFSCLVFRGVSDKGUKBYNTVADFHNJLVFSCNMZFXLDSTAKURIOPQPIREUPWPMRTWMTPOWUKAGFJNMRLTDFHNJLVF
SWOFACDFRPOT4IUKMTCZDEROUTOMCAMSLRJDFHNJLVFSTIOUMHGDKLSJERTOIHMHGBJGLPDDFHNJLVFOFPIEKRUTYMFJLPCUJDFH
NJLVFSOIYW3N4LKERPIUYIKEDMLNDFHNJLVFSJHCXVFYUITORE4PKSDOIDUHRE4KLGDFIDFHNLVFSDJHKNKLG

Can you find the repeating bits? I'll make it easier, the repeating code is DFHNJLVFS.

In this code I'll mark the parts that do repeat and look like they're repeating:
CODE
ASFLKYHBA--**DFHNJLVFS**--CLVFRGVSDKGUKBYNTVA--**DFHNJLVFS**--CNMZFXLDSTAKURIOPQPIREUPWPMRTWMTPOWUKA
GFJNMRLT--**DFHNJLVFS**--WOFACDFRPOT4IUKMTCZDEROUTOMCAMSLRJ--**DFHNJLVFS**--TIOUMHGDKLSJERTOIHMHGBJG
LPD--**DFHNJLVF**--OFPIEKRUTYMFJLPCUJ--**DFHNJLVFS**--OIYW3N4LKERPIUYIKEDMLN--**DFHNJLVFS**--JHCXVFY
UITORE4PKSDOIDUHRE4KLGDFI--**DFHNLVFS**--DJHKNKLG

Now which ones are the real repeaters, and which ones look like it?

I once wished for this kind of compression. And yes I wrote an algorithm that would search for these repetitions in strings... it is insanity unless there are magical mathematics to locate this.


Ziv-Lempel will do that, but it won't work for the "mostly the same" kind of thing you'd see in music. Since (unless we're talking about computer performance here) the musicians will not do the exact same thing twice, you'll not see any real redundancy.

QUOTE(Dynamic @ Mar 9 2007, 15:57) *
Something like a drum hit or a transient like the plucking of a guitar or bass string is likely to correlate pretty well assuming the microphone placement remains constant. In the mix or the mastering, drums and bass are both likely to have been dynamically compressed somewhat for more consistent loudness. This should improve correlation.

I'd imagine that violins, vocal vibrato, tremelo and pitch-bending effects are likely to vary too much to provide strong correlation.


Actually, drum hits are very dependent on where the stick hits the drum, as well as the mic placement, the amount of force (much more than just the level is determined by the force, this is some seriously nonlinear acoustics we're talking about here), direction of strike...

Ditto for most any instrument.

Now, drum machines ...
SebastianG
There's something called "AAC LTP" profile -- LTP = long term prediction -- which first sounds like what you are suggesting. However "long" is restricted to somewhat around 1000-2000 samples IIRC and works similarily to what is done in CELP-based coders to improve performance on "voiced" segments.

IIRC someone tried to do "ELTP" -- even longer LTP -- to predict MDCT samples via previously encoded samples for Vorbis I. But his approach didn't work out that well. You might want to check the Xiph mailing lists archive.

Even if it was possible to exploit even longer long term dependencies you would have to pay for the saved bits with increased encoder complexity and memory requirements for the en- and decoder. I guess it is not worth the hassle.

BTW: I did something similar many years ago for synthetic music. I decoded a .MOD file to .WAV, checked for longer repeating sequences (exact match) and created a single file containing blocks of sound and pointers. The new file was pretty easy to "decode". It was much faster to decode than rendering the .MOD which was initially my goal (for a little game's soundtrack back in the 386 days). But the "compression" wasn't that good. So, I ditched the whole idea.

Cheers!
SG
mat128
QUOTE(Mercurio @ Mar 9 2007, 07:21) *

I know only a bit about audio compression, but maybe working in the frequency domain, as usually is done, can be thought (more or less) close to look for "repetitions".

Yep, the best would be a good wave to "midi" converter!


More like a wave to mod/xm converter!!

Ahh good old days =)
NeonRain
The repetitions in music may be able to be utilized better with a wavelet-based codec, instead of a block-oriented MDCT. Note, this may or may not be iPod compatible, it depends mostly on clever ordering of wavelet coefficients, but if one were to compute a wavelet packet transform on an entire song, a compressor could easily pick out large repeating sections, like a chorus, or "12 bar blues".

I've been toying with the idea for a while, and haven't completely figured it out. I've more or less abandoned wavelets as the basis of my own audio codec in favor of the FFT, and so haven't given it much thought in the past year.
knutinh
Before discussing possibilities, it would be interesting to find some kind of practical upper bound for the attainable compression.

Are there any good "redundancy estimators" out there, like winzip or winrar that can exploit correlation at any timescale? Or can Matlab xcor() be used? Are lossless long-scale redundancy exploits fundamentally different from lossy ones?

Ie: is it possible that "verse#1" is more or less uncorrelated to "verse#2" on a sample-by-sample basis, but highly correlated on a perceptual basis?

-k
yerma
I guess any slight shift in pitch, volume and/or speed should be sufficiant to alter your data completely...
SebastianG
QUOTE(NeonRain @ Mar 28 2007, 12:14) *

The repetitions in music may be able to be utilized better with a wavelet-based codec, instead of a block-oriented MDCT.

Why do you think so?

QUOTE(NeonRain @ Mar 28 2007, 12:14) *

[...] I've more or less abandoned wavelets as the basis of my own audio codec in favor of the FFT [...]

on what grounds? and why FFT instead of MDCT?

Cheers!
SG
knutinh
I was under the impression that MDCT, DCT, FFT, STFT, filterbanks and wavelets could be considered subgroups of the generalised "filter-bank" concept? (IE concolving some N-dimensional signal with a set of functions (often orthogonal) that represent the information in a desirable manner).

The main question being if implemented as frame/block-based or not, and how this information is represented (for compression primarily in a way that lets one use simple quantisers for a perceptually optimal information loss, instead of vector quantisers)

People with a physics degree consider wavelets to be revolutionary, while EE people view wavelets as "just another" subgroup of filterbanks with certain desirable features?

-k
Woodinville
QUOTE(knutinh @ Mar 28 2007, 09:13) *

I was under the impression that MDCT, DCT, FFT, STFT, filterbanks and wavelets could be considered subgroups of the generalised "filter-bank" concept? (IE concolving some N-dimensional signal with a set of functions (often orthogonal) that represent the information in a desirable manner).


Yes and no. There are several things to consider:

1) Orthonormality: An MDCT and an FFT (STFT, DCT, proper wavelet etc) are all orthnormal. a PQMF or a QMF tree are not.
2) Critical sampling: an MDCT, PQMF, QMF, Wavelet are critically sampled. The analyzed domain has exactly the same number of values as the original domain.
2a) An FFT/STFT/DCT, with no overlap, is also critically sampled.
2b) An FFT... with overlap is NOT critically sampled.
3) Frequency representation across analysis boundaries.
3a) Filter banks (wavelets, MDCT's, OBT's, LOT's, QMF's, PQMF's) control the meaning of frequency across the boundaries (overlaps). STFT's using overlaps control the meaning of frequency across boundaries, but not in quite as solid a fashion, unless you use 1/2 or more overlap.
3b) STFT's, FFT's, and pure transforms without overlap ARE critically sampled, as above, but offer exactly NO control WHATSOEVER of frequency content across boundaries in analysis blocks.

In short, orthonormal filter banks (and some others, like biorthogonal filter banks) have critical sampling AND control of frequency across blocks.

Block transforms can have EITHER control of frequency across block boundaries, XOR they can have critical sampling. Not both.

Now, things like bi-orthogonal filter banks have another interesting characteristic, they are not power-complimentary (i.e. more power in analyzed than in original) domain, but they are critically sampled and they do offer frequency meaning across "blocks".
QUOTE


The main question being if implemented as frame/block-based or not, and how this information is represented (for compression primarily in a way that lets one use simple quantisers for a perceptually optimal information loss, instead of vector quantisers)


None of this is really relevant to the issue of using repeats in music. Performances are so likely to vary by so much as to make recovery of the redundancy there a chancy thing at the very best, unless you started with MIDI, and even then the latency will get you sometimes.
QUOTE


People with a physics degree consider wavelets to be revolutionary, while EE people view wavelets as "just another" subgroup of filterbanks with certain desirable features?

-k


Wavelets are yet another way to make sub-bands of a signal in a critically sampled fashion.

They are an advance in that they are orthonormal (1:1 and onto), that avoids some of the problems with QMF's.

QMF's can be put into a tree (assymetric) only if the filters used at the later stages (smaller subbands) have very little ripple, otherwise you mess up the aliasing cancellation of the first stages.

Since wavelets are orthonormal (exact reconstruction) you do not have the same problem.

That's what wavelets are good for that QMF's aren't.

There is, of course, a gotcha. Wavelet design is more constrained than QMF design, and as a result you might wind up needing more MIPS. Sometimes this matters, sometimes it doesn't.
SebastianG
QUOTE(knutinh @ Mar 28 2007, 18:13) *

I was under the impression that MDCT, DCT, FFT, STFT, filterbanks and wavelets could be considered subgroups of the generalised "filter-bank" concept? (IE concolving some N-dimensional signal with a set of functions (often orthogonal) that represent the information in a desirable manner).

Yes, you can implement block transforms as bank of filters + decimation like you already noted. That's why I really don't differentiate much between, say, the "JPEG filterbank" (DCT) and orthogonal wavelets. Of course they differ in terms of frequency response and decimation (which might be subband-dependent, like typical hierachical wavelet decompositions) but that's about it.

That's why I fail to see why the "wavelet representation" should be better suited compared to, say "MDCT representation" in terms of long term prediction. Suppose you want to exploit a simple delay (the current segment is close to a segment 42331 samples ago (segment = temporal fraction of the signal). How would you do prediction exactly? AAC-LTP does prediction in time domain. the "segment 42331 samples ago" is MDCT transformed and used as prediction for the current MDCT coeffs (not necessarily for all subbands). This of course works with any kind of filterbank. But how would you do transform-domain prediction (= prediction only based on previously coded transform coeffs without further transforms)? This is not only a problem for audio but also for motion compensation in videos. Experimental Wavelet codecs use lapped and windowed blocks that are moved around and composed again in the spatial domain as initial prediction (without block edges due to the overlaps) and then code the difference in form of wavelet coeffs. Again: No transform domain prediction is used.

I don't want to rule out that a clever algorithm exists that does transform-domain prediction (of arbitrary delays) for a specific transform. But it's certainly easier to do this in the time domain. At least "easier" as in "not so difficult to understand". ;-)

QUOTE(Woodinville @ Mar 28 2007, 23:23) *

There is, of course, a gotcha. Wavelet design is more constrained than QMF design, and as a result you might wind up needing more MIPS. Sometimes this matters, sometimes it doesn't.

I would't fully agree with this. Orthogonal (non-linear phase) 2-channel filterbanks are easily designed *. Also, low order orthogonal wavelet filters already provide good frequency responses whereas (linear phase) QMF filters need higher orders to cover up their non-PR property.

Biorthogonal linear phase wavelet filters are a different story. Filters of orders up to 18 have been published with various properties but their design process is beyond my level of understanding. Personally, I wasn't successful at designing high order biorthogonal filters.

( * Just factor a half-band lowpass filter with only "double-zeros" on the unit circle into a product of equally long minimum and maximum phase filters and then you're done. )

Cheers!
SG
pest
QUOTE(SebastianG @ Mar 29 2007, 13:42) *

This is not only a problem for audio but also for motion compensation in videos. Experimental Wavelet codecs use lapped and windowed blocks that are moved around and composed again in the spatial domain as initial prediction (without block edges due to the overlaps) and then code the difference in form of wavelet coeffs. Again: No transform domain prediction is used.


I read papers about wavelet-codecs with motion prediction in the wavelet-domain. The difficulty is a good way to compensate for phase-shifts. Another approach is MCTF where the whole picture is decomposed by a number of motion-aligned temporal transforms. In my opinion the current wavelet-coders mostly use motion prediction in the temporal-domain because you can use all the magic of existing codecs like variable block sizes and multiple reference frames very easy.
Woodinville
QUOTE(SebastianG @ Mar 29 2007, 05:42) *
QUOTE(Woodinville @ Mar 28 2007, 23:23) *

There is, of course, a gotcha. Wavelet design is more constrained than QMF design, and as a result you might wind up needing more MIPS. Sometimes this matters, sometimes it doesn't.

I would't fully agree with this. Orthogonal (non-linear phase) 2-channel filterbanks are easily designed *. Also, low order orthogonal wavelet filters already provide good frequency responses whereas (linear phase) QMF filters need higher orders to cover up their non-PR property.

Biorthogonal linear phase wavelet filters are a different story. Filters of orders up to 18 have been published with various properties but their design process is beyond my level of understanding. Personally, I wasn't successful at designing high order biorthogonal filters.

( * Just factor a half-band lowpass filter with only "double-zeros" on the unit circle into a product of equally long minimum and maximum phase filters and then you're done. )

Cheers!
SG


There are some issues, though. Wavelets are exact-reconstruction at low order, yes, however their frequency separation is not very good. If you want good rate gain, or good codec performance, you need good frequency separation, and this is where wavelets have always seemed to fall down and go boom. Either you use longer wavelets than you want to, or you wind up having lots of cross-terms that will affect your processing.
SebastianG
QUOTE(Woodinville @ Mar 29 2007, 20:21) *

There are some issues, though. Wavelets are exact-reconstruction at low order, yes, however their frequency separation is not very good.

If you restrict yourself to filters with a high count of vanishing moments (= very high rejection for a narrow range around 0 or pi but a rather large transition band) then you're right. But you don't have to. Frequency response properties of orthogonal filters can be very good. And as stated before, designing those isn't really difficult. (There's a matlab m file collection on the net that can do the job. It calculates minimum phase filters only, though. Unfortunately, I lost the link.)

You might want to compare the properties of this 26 tap filter to a linear phase QMF filter of similar order / rejection. It's a "symlet"-style filter (factorization optimized for a low group delay discrepancy) with a wide rejection band. I might publish some code for designing those if there's any interest.

Cheers!
SG
Woodinville
QUOTE(SebastianG @ Mar 31 2007, 05:03) *

QUOTE(Woodinville @ Mar 29 2007, 20:21) *

There are some issues, though. Wavelets are exact-reconstruction at low order, yes, however their frequency separation is not very good.

If you restrict yourself to filters with a high count of vanishing moments (= very high rejection for a narrow range around 0 or pi but a rather large transition band) then you're right.



It some sense it doesn't matter, wavelets have differentconstraints, but of course since they are ornthonormal they also have some advantages.

You need at least two vanishing moments, in my experience, but more is a waste of space. At low orders (8, 12, 16 ...) this does eat up some performance.
SebastianG
What's your definition of frequency separation?
The filter I linked has a transition band of 1/3 pi with a stopband rejection of over 65 dB.
The "24 tap Johnston QMF" has a narrower transition band of 1/5 pi but with a stopband rejection of only 40 dB.
It should be obvious that a narrow transition band can be traded for a high stopband rejection and vice versa for both kinds of filters. Now, why exactly have wavelet filters inferior "frequency separation"? Don't you think a wavelet filter of order 24 can do at least 40 dB rejection with a transition band of 1/5 pi?
(Unfortunately I can't test it right now because I don't have access to all the necessary tools. But I'm quite confident that giving up the linear phase requirement can lead to filters with superior amplitude responses.)

Cheers!
SG

edit: s/passband/stopband
Woodinville
QUOTE(SebastianG @ Apr 2 2007, 07:43) *

What's your definition of frequency separation?
The filter I linked has a transition band of 1/3 pi with a passband rejection of over 65 dB.
The "24 tap Johnston QMF" has a narrower transition band of 1/5 pi but with a passband rejection of only 40 dB.
It should be obvious that a narrow transition band can be traded for a high passband rejection and vice versa for both kinds of filters. Now, why exactly have wavelet filters inferior "frequency separation"? Don't you think a wavelet filter of order 24 can do at least 40 dB rejection with a transition band of 1/5 pi?
(Unfortunately I can't test it right now because I don't have access to all the necessary tools. But I'm quite confident that giving up the linear phase requirement can lead to filters with superior amplitude responses.)

Cheers!
SG


Well, consider, say, if you're trying to use that filter in a cochlear filter tree. At what point will that amount of transition band lead to a meaningful analysis?

Consider, if you're trying to get good rate gain out of the filterbank. At what point in the tree will that filter give you a good rate gain on something like a pipe organ, say, with lots of dense, stationary harmonics?

That's the problem, in my opinion.
SebastianG
BTW: I used the word "passband" above but I meant to say "stopband".

Are we still discussing QMF filters versus wavelet filters? My point was that one can easily design "wavelet filters" of any order, with any transition bandwidth. Naturally, the maximum possible stopband rejection is bounded by these two parameters (even for QMF filters):
low filter order, narrow transition bandwidth, high stopband rejection form a goal conflict triangle.

You were suggesting that wavelet filters are inferior in terms "frequency separation" compared to QMFs. Correct me if I'm wrong but "frequency separation" is directly related to the filters' transition bandwidth and stopband rejection, no? If so, you are implying that -- given a filter order and desired transition bandwidth -- QMF filters with a reasonable low reconstruction error outperform wavelet filters in terms of stopband rejection, no?

This is at least questionable.


Cheers!
SG
Woodinville
QUOTE(SebastianG @ Apr 2 2007, 11:24) *
You were suggesting that wavelet filters are inferior in terms "frequency separation" compared to QMFs. Correct me if I'm wrong but "frequency separation" is directly related to the filters' transition bandwidth and stopband rejection, no?


Well, yes to the second sentence. I've seen a lot of wavelets designed, and I haven't yet seen things that have as good a frequency response as a same-length QMF. Of course, the wavelets are exact reconstruction and the QMF isn't.

So, I'm not suggesting that anyone should use QMF's, either. Each, I think, has its own set of problems.

Both work poorly for most things compared to a proper orthonormal, critically sampled block transform, too.
SebastianG
QUOTE(Woodinville @ Apr 2 2007, 21:58) *

Well, yes to the second sentence. I've seen a lot of wavelets designed, and I haven't yet seen things that have as good a frequency response as a same-length QMF.

I take this as a challenge. If you want I can publish an equi-ripple filter with parameters of your choice:
- filter order
- count of vanishing moments
- transition bandwidth OR stopband rejection

For example:
- 24 taps, two vanishing moments, transition bandwidth of 1/5 pi => 37 dB stopband attenuation
- 24 taps, two vanishing moments, stopband attenuation of 40 dB => <2/9 pi transition band
- 24 taps, zero vanishing moments, transition bandwidth of 1/5 pi => 37.5 dB stopband attenuation
- 26 taps, one vanishing moment, transition bandwidth of 1/5 pi => >40 dB stopband attenuation

To prove the results I'm going to upload the filter coeffs in a few days ...

QUOTE(Woodinville @ Apr 2 2007, 21:58) *

Both work poorly for most things compared to a proper orthonormal, critically sampled block transform, too.

Ack. But I think they still have some use like in subband CELP speech coders where "voiced" segments can be predicted fairly well via LTP.

Cheers!
SG

edit: parameter restrictions:
only an even number of taps are possible, number of vanishing moments need to be even if the filter order is divisable by four, odd otherwise.
Woodinville
QUOTE(SebastianG @ Apr 6 2007, 03:41) *

QUOTE(Woodinville @ Apr 2 2007, 21:58) *

Well, yes to the second sentence. I've seen a lot of wavelets designed, and I haven't yet seen things that have as good a frequency response as a same-length QMF.

I take this as a challenge. If you want I can publish an equi-ripple filter with parameters of your choice:
- filter order
- count of vanishing moments
- transition bandwidth OR stopband rejection



Dude.

Publish it.

I mean PUBLISH it.

Not here.

Seriously. It ought to be possible, I think. If you can actually show that, you'll even bug Ingrid a bit.

Edited to add: In case it's not clear, this is not sarcasm. Publishing some examples of good wavelet filters, including coefficients, might help the whole field move along. Something like publishing some ok QMF designs did for QMF's.
knutinh
QUOTE(Woodinville @ Apr 6 2007, 19:31) *

QUOTE(SebastianG @ Apr 6 2007, 03:41) *

QUOTE(Woodinville @ Apr 2 2007, 21:58) *

Well, yes to the second sentence. I've seen a lot of wavelets designed, and I haven't yet seen things that have as good a frequency response as a same-length QMF.

I take this as a challenge. If you want I can publish an equi-ripple filter with parameters of your choice:
- filter order
- count of vanishing moments
- transition bandwidth OR stopband rejection



Dude.

Publish it.

I mean PUBLISH it.

Not here.

Seriously. It ought to be possible, I think. If you can actually show that, you'll even bug Ingrid a bit.

Edited to add: In case it's not clear, this is not sarcasm. Publishing some examples of good wavelet filters, including coefficients, might help the whole field move along. Something like publishing some ok QMF designs did for QMF's.

It seems that "Beatnik" is exploiting large-scale temporal redundancy in music files for mobile phones. Witty tounges may say that typical mobile phone music has more such redundancy than most other music ;-)

The only drawback is that this is some kind of enhanced general MIDI spec alongside sample triggering (tracker files anyone?) This suggests to me that designing a general "decoder" is far simpler than designing a general "encoder", while specialised encoders that imports structurised files may not be that difficult.

http://www.beatnik.com/technology/xmf.html

"Beatnik has played a key role in the development of the XMF subsequent Mobile XMF specification working with a number of industry leaders in the mobile handset and technology markets. Based in part on Beatnik’s proprietary Rich Music Format architecture, XMF is an open standard architecture providing a container in which to wrap and securely deliver multiple audio types in a single file.


With Mobile XMF, SP-MIDI music can be combined with digitally recorded custom samples (as Mobile DLS instruments) providing a richer audio experience for end users. Instead of delivering generic polyphonic ringtones constrained by the 128 instrument bank specified by general MIDI, Mobile XMF supports playback of Mobile DLS instruments for an infinite number of downloadable sound possibilities.


Practical applications for this technology abound. Instead of users listening to voiceless pop music ringers, voice samples of the pop artist can be included in the same file. Mobile game soundtrack audio can include original instrumentation and real-time effects. As multimedia messaging continues its successful rollout, XMF can be adapted to include multiple media types, including text and graphics, making it a great vehicle for MMS delivery.


XMF not only benefits end users, it also provides security and file metadata giving content publishers and distributors the flexibility to encrypt content and provide valuable data about copyrights, licensing terms, and composers. "


Claims apparently have been made of "10 times more efficient compression than mp3" which perhaps isnt that far-fetched for a copy&paste -based compression scheme when applied to copy & paste - based source music?

-k
MetalheadGautham
the next thing some one will ask is if it is posible to accurately convert an mp3 file to midi. the answer sadly, is a big no. crying.gif

But if you are an artist, you could try some stuff. Have you ever thought of Rap/HipHop, where synthetic beats are used? it is possible to sample your beats and place them at regular intervels as they are supposed to be, and add the solo samples too. and ofcourse, a Speex track of the vocals(speex is quite crystal clear. I tested it myself using some vocals only samples).

I think the MOD format works somewhat like this. has anyone tried such a thing?

PS: most other artists, like the heavy metal guys, hate shortcuts. metallica won't just sample the chorus and the riff! they will play the entire thing and make it seem similar to a human ear. wink.gif
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.