IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
Speech Activity Detection- Adhoc Problem with multi source sounds(bird
shyam.sunder91
post Sep 30 2013, 10:40
Post #1





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



Here i will first try to put my application agenda

> Typically i am sampling at 8Khz rate and filtering speech components using a BPF(digital) now i have got the 0-4Khz components in which essentially <250Hz are not speech so i will eliminate them also
> now i will be trying to find out entropy of frame and trying to mark respective frames as voice or non voice if they belong to band 250<=f<=3750 Hz

so now the problem is while i was doing this at my real time application side
there are also non human sounds like birds sound or a ringing of a bell which again fall in to the frequency band i am flagging as speech

>so this causes a false alarm of speech even in case of bird sound and bell ringing how to really over come this problem

what else i can try to do for effectvie speech activity detection

isn't this the phenomenon of aliasing even though my capture device has got high anti aliasing filters but the bird sounds which are having diff frequencies cannot be eliminated ,i think you have understood the phenomenon this is nothing more that speech actvity detection in multi source environment seems like common problem for reasearchers but i dont know how to solve it

please suggest me a frame wise solution and not any training type of detection which is a overhead in my application
Go to the top of the page
+Quote Post
bandpass
post Sep 30 2013, 12:05
Post #2





Group: Members
Posts: 321
Joined: 3-August 08
From: UK
Member No.: 56644



Unfortunately, much existing work is driven by mobile telephony, but there it is different, since transmitting the sound of a bell ringing may well be desirable, and in any case, false triggers are not catastrophic, resulting only in slightly less optimal use of the battery.

Have a look at cepstral techniques (detecting harmonic patterns/levels that are typical of speech). An important factor is the allowed delay: higher delay allows more audio to be analysed in the voice detector, and so should increase reliability (but high delay can't be used in telephony).
Go to the top of the page
+Quote Post
shyam.sunder91
post Sep 30 2013, 13:07
Post #3





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



QUOTE (bandpass @ Sep 30 2013, 16:35) *
Unfortunately, much existing work is driven by mobile telephony, but there it is different, since transmitting the sound of a bell ringing may well be desirable, and in any case, false triggers are not catastrophic, resulting only in slightly less optimal use of the battery.

Have a look at cepstral techniques (detecting harmonic patterns/levels that are typical of speech). An important factor is the allowed delay: higher delay allows more audio to be analysed in the voice detector, and so should increase reliability (but high delay can't be used in telephony).



thank you for your immediate reply u have clearly understood my scenario, my application area has more non speech sounds among them i have to figure out only speech frames(we will think the masking of one over other later smile.gif )

cepstral or harmonic or formant parameters are also exhibited by many of bird sounds and bell and siren sounds

you also raised an important parameter delay which is very important in my case as i usually encode the detected frames and transmit over a wire less network smile.gif isnt it cool ? smile.gif but its too tiresome to implement sad.gif

but has to be achieved
Go to the top of the page
+Quote Post
bandpass
post Sep 30 2013, 21:34
Post #4





Group: Members
Posts: 321
Joined: 3-August 08
From: UK
Member No.: 56644



QUOTE (shyam.sunder91 @ Sep 30 2013, 13:07) *
cepstral or harmonic or formant parameters are also exhibited by many of bird sounds and bell and siren sounds

I didn't say it was going to be easy smile.gif
Yes, but they'll be different to those of voices: you have to make measurements, and characterise the sounds you want/don't-want to trigger on.

The only implementation of I know of is the one in sox: the 'vad' effect, which triggers on a particular band of cepstral energy and has a bunch of parameters that you can play with. It might even be good enough for your needs, but if not, you could perhaps use it as a framework on which to make more specific measurements.

Also, try to mic as close as possible or use a directional mic pushing up the SNR can save a lot of signal processing effort!
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 1 2013, 05:04
Post #5





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



QUOTE (bandpass @ Oct 1 2013, 02:04) *
you have to make measurements, and characterize the sounds you want/don't-want to trigger on.

what kind of measurements any one present here who have done that ?

QUOTE
Also, try to mic as close as possible or use a directional mic pushing up the SNR can save a lot of signal processing effort!


LOL in my application speaker will be atleast 10m from the microphone and its <0dB SNR kind of environment,any how my microphone is highly sensitive (-48dBV) cool.gif
by which i also face severe noise problems too crying.gif ok i dont want to side track my topic coming to SAD sad.gif

This post has been edited by shyam.sunder91: Oct 1 2013, 05:15
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 2 2013, 04:57
Post #6





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



Any Audio Experts here who could answer my query or any other forum suggestions where i can expect answers form experts (except dsp.stackexchange)
Go to the top of the page
+Quote Post
Martel
post Oct 2 2013, 12:14
Post #7





Group: Members
Posts: 534
Joined: 31-May 04
From: Czech Rep.
Member No.: 14430



In case you were running voice recognition immediately behind it, you could as well pass the non-voice sounds through and "train" the recognizer to "know" some typical non-voice sounds (would be a part of vocabulary but would have a special non-voice flag associated with them).

Without running some kind of feature extraction/analysis/pattern matching, you can't really tell voice and non-voice apart.

A simple way (maybe too simple/academic/naive) to tell between activity/non-activity is to have an adaptive energy-based threshold. You average out the signal energy over a long period of time and compare immediate waveform energy to the long-term energy average. You can then assume that e.g. anything which has immediate energy 30%+ higher compared to the the long-term average as "voice" activity and include some hysteresis/delay when detecting the start/end of activity (to avoid clicks from triggering it and to avoid premature "ends" in pauses between words). This involves a lot of experimental/empirical constants in the algorithm and likely considerable delay (introduced with hysteresis - even in reality, you don't immediately recognize that someone is talking to you until you have heard a whole word or at least a part of a word and then you have to look back/recall to recognize the word).

This post has been edited by Martel: Oct 2 2013, 12:26


--------------------
IE4 Rockbox Clip+ AAC@192; HD 668B/HD 518 Xonar DX FB2k FLAC;
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 3 2013, 04:54
Post #8





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



QUOTE (Martel @ Oct 2 2013, 16:44) *


QUOTE
Without running some kind of feature extraction/analysis/pattern matching, you can't really tell voice and non-voice apart.


yes here is the important point extraction of which speech parameter can really differentiate a speech from voice you know recently i have recorded sounds of crickets which are quite common in our expected environment and they too have same formants as human speech WTH
QUOTE
A simple way (maybe too simple/academic/naive) to tell between activity/non-activity is to have an adaptive energy-based threshold.


This is done and has not yielded proper results for our high noisy environment typically <0dB SNR

really vexed up and not able to find a prominent feature of speech that could do my need serious help needed smile.gif
Go to the top of the page
+Quote Post
Martel
post Oct 3 2013, 08:08
Post #9





Group: Members
Posts: 534
Joined: 31-May 04
From: Czech Rep.
Member No.: 14430



There are techniques for speaker identification:

http://en.wikipedia.org/wiki/Speaker_recognition
(the article should have some useful links)

I have done only voice (content) recognition during college, not speaker recognition (who is speaking). I assume the latter would be a simpler form of the former (not requiring a vocabulary).
You could train the recognizer to treat crickets as a specific speaker and possibly so for other sources of noise. It could be highly accurate in case you trained it for a limited set of human speakers (e.g. your family plus the common sources of noise).

The general problem is that some (unknown) person's voice could have features which are closer to that of a cricket (or some other noise) than those of any other speaker used to train the recognizer (though it seems unlikely to me).

I learned that the most valuable asset in voice recognition is not the algorithm itself (the techniques are well-known - cepstrum, markov chains, dynamic time warping etc.). It's the voice sample bank used to train the recognizer (you need sufficiently clean samples, wide variety/coverage of speakers and utterances etc.).


--------------------
IE4 Rockbox Clip+ AAC@192; HD 668B/HD 518 Xonar DX FB2k FLAC;
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 3 2013, 08:24
Post #10





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



QUOTE
You could train the recognizer to treat crickets as a specific speaker and possibly so for other sources of noise. It could be highly accurate in case you trained it for a limited set of human speakers (e.g. your family plus the common sources of noise).

The general problem is that some (unknown) person's voice could have features which are closer to that of a cricket

It's the voice sample bank used to train the recognizer (you need sufficiently clean samples, wide variety/coverage of speakers and utterances etc.).


Essentially here speaker recognition is not the agenda instead it is " human recognition " or "human discrimination from all other living being through the speech "

i think you got my agenda i will place my system in jungle it has detect humans in jungle and not animals in jungle smile.gif
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 9 2013, 07:13
Post #11





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546





discrimination of speech from animal sound if you take a 150ms frame of speech and bird as shown below, what feature can i extract from them so that i can effectively state the difference

1st is speech 2nd is animal(bird)


zero crossings ? cross correlation of successive frames seems high in bird sound ? isn't it any other feature from your side ?
Go to the top of the page
+Quote Post
Martel
post Oct 9 2013, 08:15
Post #12





Group: Members
Posts: 534
Joined: 31-May 04
From: Czech Rep.
Member No.: 14430



Try thinking about why you (personally) know that it's a bird and not a human (and can you really tell for sure 100% of the time?). Are you doing zero-crossings or cross-correlation analysis of the signal? Would you be able to tell that it's not somebody talking in case you didn't already know it was a bird (from a previous experience/training)?

Base on my (limited) experience, I don't think you can achieve satisfying results while avoiding some form of time/frequency analysis and algorithm "learning/teaching" process.

How many human voices, birds and other sounds have you heard/learned during your life? smile.gif


--------------------
IE4 Rockbox Clip+ AAC@192; HD 668B/HD 518 Xonar DX FB2k FLAC;
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 9 2013, 09:51
Post #13





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



QUOTE (Martel @ Oct 9 2013, 12:45) *
Try thinking about why you (personally) know that it's a bird and not a human (and can you really tell for sure 100% of the time?). Are you doing zero-crossings or cross-correlation analysis of the signal? Would you be able to tell that it's not somebody talking in case you didn't already know it was a bird (from a previous experience/training)?

Base on my (limited) experience, I don't think you can achieve satisfying results while avoiding some form of time/frequency analysis and algorithm "learning/teaching" process.

How many human voices, birds and other sounds have you heard/learned during your life? smile.gif



cant there be a parameter which could differentiate them with out training sad.gif i am poor i.e., my system is poor so it will not afford high memory and processing requirements
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 9 2013, 10:03
Post #14





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546






one more example in above fig black-speeches,yellow siren,red bird,safron heavy wind the spikes are foot steps


above figure shows the difference in formant positions
in bird sounds the formants are differently spaced and less formants

will anything could be extracted from this analysis
Go to the top of the page
+Quote Post
Martel
post Oct 9 2013, 12:19
Post #15





Group: Members
Posts: 534
Joined: 31-May 04
From: Czech Rep.
Member No.: 14430



Musepack's encoder has a feature called Clear Voice Detection (CVD) in its psychoacoustical model. Maybe you could have a look at the source code and see what/how it does.


--------------------
IE4 Rockbox Clip+ AAC@192; HD 668B/HD 518 Xonar DX FB2k FLAC;
Go to the top of the page
+Quote Post
shyam.sunder91
post Oct 14 2013, 16:55
Post #16





Group: Members
Posts: 20
Joined: 30-September 13
Member No.: 110546



QUOTE (Martel @ Oct 9 2013, 16:49) *
Musepack's encoder has a feature called Clear Voice Detection (CVD) in its psychoacoustical model. Maybe you could have a look at the source code and see what/how it does.


no it wont solve my problem of speech only detection sad.gif hmmm what else i could do ?.... really vexed up with this these days because of the limited memory and processing power dry.gif
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 17th April 2014 - 10:39