Help - Search - Members - Calendar
Full Version: Comparison of audio files
Hydrogenaudio Forums > Hydrogenaudio Forum > General Audio
machosry
Consider I have 2 audio files of any format say A and B. A contains a speech by a person X and 10 minutes long. B contains a small clip of A but spoken by another person Y. Now I need to compare the audios A and B and tell whether A contains the words in or parts in B or B is related to A somehow. Please help me how to do this or any guidance to any algorithms which helps me to do this. No matter whether it is hard to implement. Problems with this case are 1) A and B can be in different file formats. 2) A can contain noises, music etc but B is pure without any noises sometimes can be computer generated voice.

I also need to know whether there is any format in which these files can be compared easily. I heard Fast Fourier Transformation can be used to compare but not sure about the implementation techniques I just came across that in Google.
DVDdoug
This is called Speech Recognition. I suppose FFT is a "starting point", but the hard part is analyzing the FFT results. I assume there are programming libraries that can help... You're trying to write a program, right? I suggest you search SourceForge.net and/or MSDN.

And, you might look for a book or two on speech recognition and/or audiology. I read somewhere that an audiologist can sometimes look at a spectrogram of a single word and tell you what word was spoken (probably simple, common, words like "yes").

Voice recognition can work fairly well in some situations, especially with a limited vocabulary. For example, if the automated system at the credit card company asks you to say "yes" or "no", or if it asks you to speak the numbers in you account number. It can also work well if it's "trained" for your particular voice and vocabulary.

It doesn't work that well when you have unknown speakers (maybe some with different accents) and a big vocabulary. I don't know how well it works with singing, but I suspect this makes it more difficult. It also doesn't work as well with lots of noise, and I doubt it's going to work at all if you are trying to recognize the lyrics on a typical music CD (with mixed vocals and instrumentation).

P.S.
biggrin.gif Now that I think about it... Most people can't understand the words on a typical music CD! And, human "speech recognition" isn't as good as you'd think either... Usually, we hear the words in context and our brain fills-in the blanks without us realizing it... You know... Sometimes you have to think about what somebody said for a half-second or so before you realize what they said?
machosry
I think this is where everyone confuses. Its is kinda speech recognition but not exactly. I put it in this way. If i have two audios A1 and A2 and I want to compare them both so what are the ways do i have to do this.
DVDdoug
You might be right... This is way beyond my knowledge and abilities... But, I believe this requires the same techniques as speech recognition. You've got two different speakers, probably speaking at different loudness, different pitches, different vocal overtones & characteristics (everybody's voice sounds different), and at different rates (different speeds). So, although two words (or sentences) might be the same, the sound isn't really the same.

If you had a "working" speech recognition system that could turn your "A" and "B" files into text, it's a simple matter to compare the text.

And, the system you've described could be converted into a speech recognition system.... If one file contains a "database" of known words and the other file contains a series of unknown words, whenever you get a match, you've "recognized" the word.

P.S.
Again, just this is a "starting point", but you might want to look into spectrograms. A spectrogram is a visual representation of the frequency content over time (usually done with FFT). For your application, you don't need to "see" the sound, but it might help in your persuit. And, here is a FREE spectrogram program, so maybe you can visually compare your files.
Woodinville
What you are describing breaks down to multiuser, unlimited-vocabulary speech recognition.

Nobody's done it yet.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2009 Invision Power Services, Inc.