Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Topic: Speech Activity Detection- Adhoc Problem with multi source sounds(bird (Read 8948 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

2013-09-30 10:40:56

Here i will first try to put my application agenda

> Typically i am sampling at 8Khz rate and filtering speech components using a BPF(digital) now i have got the 0-4Khz components in which essentially <250Hz are not speech so i will eliminate them also
> now i will be trying to find out entropy of frame and trying to mark respective frames as voice or non voice if they belong to band 250<=f<=3750 Hz

so now the problem is while i was doing this at my real time application side
there are also non human sounds like birds sound or a ringing of a bell which again fall in to the frequency band i am flagging as speech

>so this causes a false alarm of speech even in case of bird sound and bell ringing how to really over come this problem

what else i can try to do for effectvie speech activity detection

isn't this the phenomenon of aliasing even though my capture device has got high anti aliasing filters but the bird sounds which are having diff frequencies cannot be eliminated ,i think you have understood the phenomenon this is nothing more that speech actvity detection in multi source environment seems like common problem for reasearchers but i dont know how to solve it

please suggest me a frame wise solution and not any training type of detection which is a overhead in my application

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #1 – 2013-09-30 12:05:43

Unfortunately, much existing work is driven by mobile telephony, but there it is different, since transmitting the sound of a bell ringing may well be desirable, and in any case, false triggers are not catastrophic, resulting only in slightly less optimal use of the battery.

Have a look at cepstral techniques (detecting harmonic patterns/levels that are typical of speech). An important factor is the allowed delay: higher delay allows more audio to be analysed in the voice detector, and so should increase reliability (but high delay can't be used in telephony).

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #2 – 2013-09-30 13:07:38

Quote from: bandpass on 2013-09-30 12:05:43

Unfortunately, much existing work is driven by mobile telephony, but there it is different, since transmitting the sound of a bell ringing may well be desirable, and in any case, false triggers are not catastrophic, resulting only in slightly less optimal use of the battery.

Have a look at cepstral techniques (detecting harmonic patterns/levels that are typical of speech). An important factor is the allowed delay: higher delay allows more audio to be analysed in the voice detector, and so should increase reliability (but high delay can't be used in telephony).

thank you for your immediate reply u have clearly understood my scenario, my application area has more non speech sounds among them i have to figure out only speech frames(we will think the masking of one over other later )

cepstral or harmonic or formant parameters are also exhibited by many of bird sounds and bell and siren sounds

you also raised an important parameter delay which is very important in my case as i usually encode the detected frames and transmit over a wire less network isnt it cool ? but its too tiresome to implement

but has to be achieved

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #3 – 2013-09-30 21:34:59

Quote from: shyam.sunder91 on 2013-09-30 13:07:38

cepstral or harmonic or formant parameters are also exhibited by many of bird sounds and bell and siren sounds

I didn't say it was going to be easy
Yes, but they'll be different to those of voices: you have to make measurements, and characterise the sounds you want/don't-want to trigger on.

The only implementation of I know of is the one in sox: the 'vad' effect, which triggers on a particular band of cepstral energy and has a bunch of parameters that you can play with. It might even be good enough for your needs, but if not, you could perhaps use it as a framework on which to make more specific measurements.

Also, try to mic as close as possible or use a directional mic — pushing up the SNR can save a lot of signal processing effort!

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #4 – 2013-10-01 05:04:16

Quote from: bandpass on 2013-09-30 21:34:59

you have to make measurements, and characterize the sounds you want/don't-want to trigger on.

what kind of measurements any one present here who have done that ?

Quote

Also, try to mic as close as possible or use a directional mic — pushing up the SNR can save a lot of signal processing effort!

LOL in my application speaker will be atleast 10m from the microphone and its <0dB SNR kind of environment,any how my microphone is highly sensitive (-48dBV)
by which i also face severe noise problems too ok i dont want to side track my topic coming to SAD

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #5 – 2013-10-02 04:57:13

Any Audio Experts here who could answer my query or any other forum suggestions where i can expect answers form experts (except dsp.stackexchange)

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #6 – 2013-10-02 12:14:15

In case you were running voice recognition immediately behind it, you could as well pass the non-voice sounds through and "train" the recognizer to "know" some typical non-voice sounds (would be a part of vocabulary but would have a special non-voice flag associated with them).

Without running some kind of feature extraction/analysis/pattern matching, you can't really tell voice and non-voice apart.

A simple way (maybe too simple/academic/naive) to tell between activity/non-activity is to have an adaptive energy-based threshold. You average out the signal energy over a long period of time and compare immediate waveform energy to the long-term energy average. You can then assume that e.g. anything which has immediate energy 30%+ higher compared to the the long-term average as "voice" activity and include some hysteresis/delay when detecting the start/end of activity (to avoid clicks from triggering it and to avoid premature "ends" in pauses between words). This involves a lot of experimental/empirical constants in the algorithm and likely considerable delay (introduced with hysteresis - even in reality, you don't immediately recognize that someone is talking to you until you have heard a whole word or at least a part of a word and then you have to look back/recall to recognize the word).

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #7 – 2013-10-03 04:54:06

Quote from: Martel on 2013-10-02 12:14:15

Quote

Without running some kind of feature extraction/analysis/pattern matching, you can't really tell voice and non-voice apart.

yes here is the important point extraction of which speech parameter can really differentiate a speech from voice you know recently i have recorded sounds of crickets which are quite common in our expected environment and they too have same formants as human speech WTH

Quote

A simple way (maybe too simple/academic/naive) to tell between activity/non-activity is to have an adaptive energy-based threshold.

This is done and has not yielded proper results for our high noisy environment typically <0dB SNR

really vexed up and not able to find a prominent feature of speech that could do my need serious help needed

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #8 – 2013-10-03 08:08:18

There are techniques for speaker identification:

http://en.wikipedia.org/wiki/Speaker_recognition
(the article should have some useful links)

I have done only voice (content) recognition during college, not speaker recognition (who is speaking). I assume the latter would be a simpler form of the former (not requiring a vocabulary).
You could train the recognizer to treat crickets as a specific speaker and possibly so for other sources of noise. It could be highly accurate in case you trained it for a limited set of human speakers (e.g. your family plus the common sources of noise).

The general problem is that some (unknown) person's voice could have features which are closer to that of a cricket (or some other noise) than those of any other speaker used to train the recognizer (though it seems unlikely to me).

I learned that the most valuable asset in voice recognition is not the algorithm itself (the techniques are well-known - cepstrum, markov chains, dynamic time warping etc.). It's the voice sample bank used to train the recognizer (you need sufficiently clean samples, wide variety/coverage of speakers and utterances etc.).

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #9 – 2013-10-03 08:24:58

Quote

You could train the recognizer to treat crickets as a specific speaker and possibly so for other sources of noise. It could be highly accurate in case you trained it for a limited set of human speakers (e.g. your family plus the common sources of noise).

The general problem is that some (unknown) person's voice could have features which are closer to that of a cricket

It's the voice sample bank used to train the recognizer (you need sufficiently clean samples, wide variety/coverage of speakers and utterances etc.).

Essentially here speaker recognition is not the agenda instead it is " human recognition " or "human discrimination from all other living being through the speech "

i think you got my agenda i will place my system in jungle it has detect humans in jungle and not animals in jungle

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #10 – 2013-10-09 07:13:54

discrimination of speech from animal sound if you take a 150ms frame of speech and bird as shown below, what feature can i extract from them so that i can effectively state the difference

1st is speech 2nd is animal(bird)

zero crossings ? cross correlation of successive frames seems high in bird sound ? isn't it any other feature from your side ?

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #11 – 2013-10-09 08:15:47

Try thinking about why you (personally) know that it's a bird and not a human (and can you really tell for sure 100% of the time?). Are you doing zero-crossings or cross-correlation analysis of the signal? Would you be able to tell that it's not somebody talking in case you didn't already know it was a bird (from a previous experience/training)?

Base on my (limited) experience, I don't think you can achieve satisfying results while avoiding some form of time/frequency analysis and algorithm "learning/teaching" process.

How many human voices, birds and other sounds have you heard/learned during your life?

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #12 – 2013-10-09 09:51:44

Quote from: Martel on 2013-10-09 08:15:47

Try thinking about why you (personally) know that it's a bird and not a human (and can you really tell for sure 100% of the time?). Are you doing zero-crossings or cross-correlation analysis of the signal? Would you be able to tell that it's not somebody talking in case you didn't already know it was a bird (from a previous experience/training)?

Base on my (limited) experience, I don't think you can achieve satisfying results while avoiding some form of time/frequency analysis and algorithm "learning/teaching" process.

How many human voices, birds and other sounds have you heard/learned during your life?

cant there be a parameter which could differentiate them with out training i am poor i.e., my system is poor so it will not afford high memory and processing requirements

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #13 – 2013-10-09 10:03:53

one more example in above fig black-speeches,yellow siren,red bird,safron heavy wind the spikes are foot steps

above figure shows the difference in formant positions
in bird sounds the formants are differently spaced and less formants

will anything could be extracted from this analysis

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #14 – 2013-10-09 12:19:56

Musepack's encoder has a feature called Clear Voice Detection (CVD) in its psychoacoustical model. Maybe you could have a look at the source code and see what/how it does.

Speech Activity Detection- Adhoc Problem with multi source sounds(bird

Reply #15 – 2013-10-14 16:55:38

Quote from: Martel on 2013-10-09 12:19:56

Musepack's encoder has a feature called Clear Voice Detection (CVD) in its psychoacoustical model. Maybe you could have a look at the source code and see what/how it does.

no it wont solve my problem of speech only detection hmmm what else i could do ?.... really vexed up with this these days because of the limited memory and processing power

Notice