Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Seeking feedback on a blockless lossy audio coding scheme (Read 10392 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Seeking feedback on a blockless lossy audio coding scheme

Hi All,

I've developed a codec scheme based on mathematical techniques I used in my PhD research on the semiclassical methods in physics.  I'm pretty sure the idea is novel, and I provisionally patented it just in case it's worth anything.  Problem is, no experts from industry will read it because they are afraid doing so will raise IP issues.

Roughly speaking, the method is the natural---canonical, I think---way to do multiband gain processing.  The key was a powerful result from the theory of deformation quantization which lays out the circumstances under which a time-varying filter can be easily inverted.  As a bonus, the method implies a phase space measure of perceptual entropy.

Anyway, I hope audio enthusiasts with a mathematical bent will read and enjoy the paper I've posted at

http://arxiv.org/abs/0707.0514

Looking forward to your take,
Matthew Cargo

Seeking feedback on a blockless lossy audio coding scheme

Reply #1
What time resolution do you achive for masking with this scheme? It is interesting that you do away with blocks.

Seeking feedback on a blockless lossy audio coding scheme

Reply #2
In principle, any time resolution is possible, provided the key spectrogram is of bounded variation---that is, that at any point on the time frequency plane,  the characteristic time scale times the characteristic frequency scale is sufficiently larger (about 40 times) than the uncertainty scale delta t  delta f = 1.  In practice, I'm using an extremely simple psymodel that first builds the spectrogram using gaussian window with a variance of 128 samples.  I then smooth it with a kernel with widths of 1000 Hz (to 100dB down), forward .012 second and backward .02 second.

So far, the biggest problem has been reliably identifying regions of the phase plane characterized by noise, where the masking threshold can be raised.  When I simply raise the whole shaped noise floor,  percussive sounds are the last to distort, so I'm not particularly worried about increasing temporal resolution.  In fact, if you want, you can listen to the shaped noise by itself, and the percussion is not too far off. Perhaps though, now that I think about it, I'd be getting better overall compression if I increased temporal sensitivity near transient events.

Seeking feedback on a blockless lossy audio coding scheme

Reply #3
Hi All,

I've developed a codec scheme based on mathematical techniques I used in my PhD research on the semiclassical methods in physics.  I'm pretty sure the idea is novel, and I provisionally patented it just in case it's worth anything.  Problem is, no experts from industry will read it because they are afraid doing so will raise IP issues.

Roughly speaking, the method is the natural---canonical, I think---way to do multiband gain processing.  The key was a powerful result from the theory of deformation quantization which lays out the circumstances under which a time-varying filter can be easily inverted.  As a bonus, the method implies a phase space measure of perceptual entropy.

Anyway, I hope audio enthusiasts with a mathematical bent will read and enjoy the paper I've posted at

http://arxiv.org/abs/0707.0514

Looking forward to your take,
Matthew Cargo


Interesting. Have you considered sending this to the ASSP Speech and Audio Transactions?
-----
J. D. (jj) Johnston

Seeking feedback on a blockless lossy audio coding scheme

Reply #4
Quote
Interesting. Have you considered sending this to the ASSP Speech and Audio Transactions?


Is that the same as IEEE Speech and Audio Transactions?  If so, it was my plan to submit it there, but I thought I should first get some feedback from audio folks.  Being a physicist, I'm not sure whether the paper is written appropriately for the intended audience.  Also, I'm not 100% sure that the idea is new.  Perhaps it is an existing idea written in a different mathematical language, or maybe there is an existing method (such as TNS) that is a sufficiently accurate approximation to it.  I don't think either of those are true, but, as I said, I'm not certain.

Seeking feedback on a blockless lossy audio coding scheme

Reply #5
I don't want to seem whiny here, but did anyone read the paper?  I really did want feedback, even if it's negative.

Seeking feedback on a blockless lossy audio coding scheme

Reply #6
Well, it's interesting, but being a simple person I'd have to implement it to get any kind of feeling for how well it works.*

Given that you've got a provisional application filed, why would I spend my time doing that? It cannot be of any possible use to me, could be of use to you, and you aren't paying me.

That's not to say I have a problem with patents, but the co-operation you commonly see here at HA and other places is on work which is either patent-free, or in which it is unlikely (to say the least) that the patent owners are going to have any interest in what's happening here. People co-operate to make something better because they expect to be able to use it (for free) and share it.

However, if you shared code, someone might just be tempted to play with it in an idol moment just out of curiosity. The problem is that there aren't many source-code aware people here with idol moments!

* Maybe that's not strictly true. It appears it'll work as well as you make it work, if you see what I mean. Still, there are bound to be issues and an implementation is the best way of finding these.

I understand your need to get some feedback. By the end of the year, you need to know whether it's promising (and move on to a full patent application, incurring considerable more cost) or not (dropping the provisional application, and hence releasing the idea of the public domain).

I sincerely wish you luck, because it's tough for inventors. By the way, in the UK, we don't have the cheap "provision" application option, so it's pain and expense straight away - though there's not much to do during the first year of a full application thankfully, and UK inventors can always cheat and file provisional apps. in the USA to grab early protection.

Cheers,
David.

Seeking feedback on a blockless lossy audio coding scheme

Reply #7
David,

Thanks so much for replying.  Your "it's interesting" is in fact the first piece of concrete feedback I have been able to get from anyone anywhere.  I know it was in bad form to ask people in this cooperative forum to ask for comment on an idea that I may eventually patent, but I was/am pretty desperate: more than a year after finishing graduate school, I still haven't found a job, and I had been certain that this invention, together with my PhD, would prove my employability.  I was blindsided that people in the field couldn't even look at the idea, for IP reasons.  I justified asking for feedback here by hoping that people would at least find the paper intellectually stimulating.

If I continue not to get responses from potential employers, I may well soon release my source code so that the idea has a chance at the cooperative development I think it deserves.  I had in fact already been thinking along those lines. You are exactly right that there are implementation issues.  The simple psychoacoustic model I suggest in the paper works well enough, but it isn't competitive for difficult-to-compress sources where I know the noise floor can be raised.  I've been tinkering with various ways to correct the noise floor, with somewhat positive results, but it's been hard finding a way to reliably raise the floor on hard-to-compress sources without tampering the already-correct floor on easy-to-compress files.  I think the general idea is missing one crucial bit of insight.

Matt

Seeking feedback on a blockless lossy audio coding scheme

Reply #8
Your paper is very math-heavy, aimed at an audience of academics. I'm a first/second-year computer science student, and although I have several years of programming experience, I find exactly what's going on difficult to conceptualize. Could you give us a simplified, math-light run down of what's novel about your approach?

Maybe your problem with employment is that the HR departments that read your paper have no idea what's going on.

Seeking feedback on a blockless lossy audio coding scheme

Reply #9
David,

Thanks so much for replying.  Your "it's interesting" is in fact the first piece of concrete feedback I have been able to get from anyone anywhere. [...]  I was blindsided that people in the field couldn't even look at the idea, for IP reasons.


I am no longer active in a company in the field, so I don't have to care about "IP reasons", and this is my feedback: I do not understand your paper.

A paper that is very hard to understand except for a select few people is not a good reference for a prospective employer. It might indicate that you are not very good at explaining things understandably, which is a serious liability in a commercial environment.

Seeking feedback on a blockless lossy audio coding scheme

Reply #10
Two say they don't understand it. Great, now we're getting somewhere. I did run the paper by some physicist friends first, and they didn't have any problem understanding it.  But I was worried it might not be appropriate for DSP engineers. This was one reason I asked for feedback. The other was the slight concern that the idea might be an unoriginal one expressed in different mathematical language.

To answer Canar, I'll begin by summarizing my understanding of standard lossy perceptual coding.  In the simplest method, we divide the signal into blocks and then smooth each block's DFT energy spectrum by convolving it with a spreading function.  The square root of this smoothed function is proportional to the perceptual noise floor, which we use to set the quantization scale for the block's DFT spectrum.  In the active point of view, we divide the DFT by the noise floor, and quantize on a unit scale.  To decode, we re-multiply by the noise floor, and inverse the DFT.

Quantizing the encoded signal introduces unit quantization noise, and this noise is then colored during the decoding stage. By choosing the spreading function according to psycho-acoustical experiments, this noise is masked by the rest of the signal. 

The well known problem with this simple procedure is that the shaped quantization noise is spread evenly through its block, and so, if there is a time localized sound, the noise might be be heard in the relatively quiet parts of its block. These are blocking artifacts. 

Temporal noise shaping (TNS), used in the most advanced codecs, is the standard fix. TNS shapes the quantization noise so that it follows more closely temporally localized sounds. To be honest, I don't fully understand TNS. I do know, however, that when it is used in the MDCT basis, which partially overlaps, it sometimes causes a pre-echo artifact, as quantization noise is bumped into neighboring blocks.

My method eliminates the the pre-echo artifact. (This is the main reason I think my method is novel; the other reason is that I have looked at the matrix elements of my encoding transformation in a basis of juxtaposed DFT's, and there's no way anyone would write down a matrix like that.) It does this is by avoiding blocks altogether and shaping noise simultaneously in the time and frequency plane.

The price paid is that I encode and decode using pseudo-differential operators, which are not diagonal in any standard basis, and we must the asymptotic properties of the Weyl symbol calculus to practically invert the decoding transformation. In practice, the added abstraction which justifies the method does not complicate its implementation. I was able to code it up in several weeks, despite not having coded in C for 10 years.

If you got confused by my paper, it was probably in the section where I summarize the Weyl symbol and the Moyal star product.  It's a shame that DSP engineers are not generally more familiar with this formalism. Most of the major results are over 50 years old, and I know Cohen in particular has strongly advocated for it in the signal processing area.  To be fair, even in physics and physical chemistry, we practitioners of the phase space point of view are often frustrated by how infrequently it informs our cohort. In physics, these methods are not part of standard graduate training; we usually get a few lectures on WKB approximation, but nothing on the rigorous formalism.

The crucial fact is that certain matrices (slowly changing filters) can be better understood as functions on the time-frequency plane. In physics, we are talking about the correspondence between quantum mechanical operators and functions on the corresponding classical phase space. There's a huge literature on this subject; common terms are geometric quantization, deformation quantization, symbol correspondences, WKB approximations, the method of stationary phase, micro-local analysis, Moyal star product, the metaplectic and Heisenberg groups, and coherent states.

In my method, we visualize the shaped quantization noise as a function on the time-frequency plane (rather than a juxtaposed series of functions of frequency alone in each block). We use the Weyl correspondence between operators and functions to directly convert this function into the decoding operator.  Fancy properties of the Weyl formalism are used to justify the inverse, encoding operator, but as I mentioned earlier, it is rather easy in practice.

Seeking feedback on a blockless lossy audio coding scheme

Reply #11
It sounds utterly fascinating. You might be able to achieve some great quality at low bitrates from the sounds of it. However, most psychoacoustic encoders are already sufficiently transparent WRT pre-echo. So, pre-echo is a problem, sure, but we throw enough bitrate at it that we solve the problem.

I'll look into these formalisms more. They sound worthy of further study!

Seeking feedback on a blockless lossy audio coding scheme

Reply #12
Canar, you've hit the nail on the head.  This idea comes too late on the scene to be of real value.  Memory is too damn cheap.  There's some hope that it might be useful for voice encoding where bandwidth will always be an issue, but then there's a latency problem to think through.

However, I stand by that it's interesting and what should have been done in the first place!

Seeking feedback on a blockless lossy audio coding scheme

Reply #13
I do not have deep enough mathematical knowledge to fully understand your approach, but I sounds to me a real breakthrough. Are you saying that your idea is able to create a blockless compression scheme so that you could cut and join a sound file much more precisely than what can be achieved with the traditional methods? lower encoding/decoding delay? Less prone to pre-echos?
Better quality at lower bitrates?

Anyway congratulations, it quite interesting.

Enrico

Seeking feedback on a blockless lossy audio coding scheme

Reply #14
Enrico,
I don't think I can make any of those claims without running afoul of the house rules, except for the one about pre-echo---that definitely never happens with this method.  I do believe that when the psychoacoustical model is brought to a sufficient degree of sophistication, the method will be able to devour those extra bits used to suppress the jaggedness of quantization noise in block-based coding schemes. The bit rate might go down by 10 or 20 percent. 

The big advantage, though, as I see it, is that the method effects a clean separation between the lossy encoding stage, which exploits psychoacoustical phenomena, and the subsequent lossless encoding stage,  which exploits redundancy in the encoded signal.  After we "divide" off the noise floor (in phase space), we are then free to seek the unitarily equivalent basis whose coefficients, when quantized at a unit scale, take up the least space.  For example, it would be interesting to try storing the encoded signal in various wavelet bases.

Matt

Seeking feedback on a blockless lossy audio coding scheme

Reply #15
Thank you for the additional explication. You observation that the formalism you use is not taught by default in most places is correct. This makes the paper look like Chinese on a first reading.

Can you elaborate more on the following requirements of a practical implementation:

a) Processing power: i.e. what kind of operations are needed on what amount of data
b) Amount of memory needed
c) Which side information must be coded. (Or: what data must be transmitted, but see below)

In my opinion, getting rid of pre-echo is not a big issue. We have simple, understood methods that deal with it acceptably enough.

But the big pain of any block-based audio coding scheme is the need to have a large amount of side information per block. If you get rid of block-based side information in a modern codec, you will gain 20% instantly, and much more at very low bitrates. If you can reduce it, you won't win as much, but you will still win.

The advantage of these schemes is that they allow efficient seeking.

As far as I understand, the section on "storing the key" exactly deals with the problem of storing the side information, namely that of storing the scalefactors in a normal codec. Maybe existing implementations can you help along there.

Another observation that I want to make is that at page 17, to efficiently store the data you end up with a block-based scheme again, because you need an extra time->frequency transformation. I have to read your paper again now to understand the consequences of the DFT->quantization->dequantization->DFT cycle there, and why blocking isn't a problem. Or wait, it is "If the chunks are too small, frequency localization in the DFTs can cause the quantization noise assumptions to break down and introduce a noticeable warble to the decoded signal." What's the effect of making those blocks larger?

Seeking feedback on a blockless lossy audio coding scheme

Reply #16
Garf,

These are all good questions.  Let me start with the last one.  How can I justify calling this a blockless coding scheme when I end up storing the information in DFT'd blocks anyway?  Well, to be glib, it's because I don't do the coding step in blocks.  I'm only using the basis of DFT'd blocks because it is more compressible, but any basis will do. There's a simple theorem, not in the paper, that uncorrelated white noise remains so in any unitarily equivalent basis. 

The warble I mentioned I've since learned is called musical noise, and it occurs here when many of a block's DFT coefficients are already near zero before quantization.  When shaped during the decode, this noise is still below the masking threshold.  However, since it is musical, it is less easily masked.  I should go back and check that statement you quoted. I haven't heard that warble in a while, and perhaps it was just a bug in earlier code. At the time I wrote it, I was convinced that using larger blocks meant that fewer coefficients were close to zero before quantization, hence making the quantization error statistics more uniform.

As I said before, any basis will do for the lossless stage.  I'm curious about wavelets. 

As a side question, I read that LPC is dual to coding in the frequency domain, but that can't be literally true, because I get better results compressing with DFT'd blocks than with flac.

Also, I've heard that the MDCT partially overlaps.  Does this mean it is an overcomplete basis?  I think I read somewhere that you need 10 percent more storage because of this.  If that's true, not needing the MDCT would be an advantage for my method.

Now for the side information questions.  The key spectrogram is the side information in this codec, and compressing it raises an interesting general question: how can we generate an smooth approximation to a given smooth function from the least amount of stored information?  I haven't done much work on the general question since I convinced myself that I could reduce the side information to about 2 percent (of the original file size) using simple linear interpolation.  I've since been focusing my attention at the part of the code that I think will give me the best immediate improvement in overall compression ratio.  Right now that's the psychoacoustical model. I don't have any precise numbers now, but I think, with my current storage method, the side info size is roughly comparable to the side info size in standard codecs.

Other implementation issues need to be worked out too.  The main decode loop is done in the time domain, with about 500 arithmetic operations per sample. This is in addition to operations necessary to unpack and to put the encoded signal back in the time domain.  The main decode loop takes up the most processor time. Overall it does run in real time, but it's processor heavy.  The transformation is probably more nearly diagonal in the basis of DFT'd blocks and that would speed up the decode, especially since I'm already storing the signal in that basis. (This was in fact my main fear for the method's novelty: that when I write the decode in this basis, I'd simply get the TNS matrix elements.)  Also, the decode loop could be parallelized or hardwired, though I'm quite sure the latter will never happen.

Loose ends: memory requirements are fairly minimal, at least for a computer.  I did write a seek capability into the decoder.  I'm not sure you'd call it efficient, but it works well enough for me.

To summarize, the method is very young, and it has in front of it a host of implementation/optimization questions that are already behind existing codecs.  Those, together with the ubiquity of other codecs, constitute a huge entry barrier.  It may just end up a mathematical curiosity.  In physics, we have a bias toward smooth processes governed by differential equations, and I suspect the masking mechanism of the ear is an example.  I believe, then, that the correct psychoacoustic model is based on smooth functions in phase space, and that such a model stands the best chance of maximizing the perceptually based coding gain.

Seeking feedback on a blockless lossy audio coding scheme

Reply #17
How about you write a side paper (or an "Open letter") that:
- Presents the mathematical premises of your work -with refs
- Lists the advantages, proven or expected, of your approach
- Gives numbers, for a selected set of sources and parameters, that show how good it is

In short, promote it! If you manage to generate some buzz you stand better chances.

Seeking feedback on a blockless lossy audio coding scheme

Reply #18
jido,
Sounds like a good thing to do.
Matt

BTW, do ignore my earlier blathering about the MDCT transform.  I just read up on it.  I confused the overlapping with the oversampling in the mp3 filterbank.