Speech/voice recognition: Audio signal electrical properties
With the exponential growth of computing resources and capabilities, picture recognition and speech recognition is no more that daunting task. The lowest costing smart phone to cheap notebooks sold now in the market have enough capabilities at the edge side to process voice and video. Even if you have a simple and basic system with just microphone, encoder and a simple processor to connect (high-speed) audio and video data to a cloud computer, you can very well have speech and picture recognition with such systems.
In this article let's look at properties of an audio speech signal in such a way that computer can understand and convert into text equivalent of that speech.
Let's try to find out what the dictionary meaning of word ‘audio’. Audio is anything related to sound what we people can hear. So the speech, voice, sound is all audio. Generally accepted audio frequency band is in the range of 20 Hertz - 20,000 Hertz. This is the frequency band human ear can sense. A audio sound where we can't really make out anything and annoying is noise, otherwise audio-noise. Why we are calling it as audio noise! Since we also have noise in other frequency bands.
A audio signal can be recorded, transmitted and re-played. We human beings can produce sounds using vocal cords with the support of lungs, tongue, mouth, nose, and throat. However for hearing it's only ear, an exclusive organ for hearing purpose. Two electromechanical device equivalents for human talk-sys and ears are electromagnetic speaker, and microphone.
Microphone converts acoustic sound wave into electrical signal and speaker converts electrical signal into acoustic sound wave. The converted electrical signal can be transmitted, stored using various methods. To bring in the computer/digital systems into this, you need to convert analog signal into digital signal. You have a whole lot of processing algorithms to process the digitized audio data.
In this first article, we will be touching upon various electrical properties of analog audio signal. Which is important to understand to process the audio in digital domain.
Using Cathode-ray oscillo - scope (CRO) is the best way to see the audio signal. Below is the amplitude versus time plot of the spelled English alphabet 'D' displayed on CRO. (All the CRO plots in this article: time division scale is 5 milliseconds/div and amplitude division scale is 1 volt per division).
From the picture above you can make out, its alternating current signal with multiple frequencies with different amplitudes. Another best way to plot is, to show three-dimensional graph with frequency on y-axis, time on x-axis and a brightness level as third dimension indicating amplitude. Such a device is called spectrogram.
Below is a spectrogram capture of pronunciation of English alphabet 'D'
The closest Indian alphabet to English alphabet D is '?' in Hindi, ‘?” in Tamil, ‘?’ in Telugu and ‘?’ in Kannada. Below is the plot of amplitude versus time (CRO) and frequency versus time (spectrogram) for this Indian alphabet.
Below is thespectrogram picture of Pronunciation Indian alphabet ‘Da’
Below is CRO plot of amplitude versus time for the pronunciation of alphabet ‘S’
Below is a spectrogram for the pronunciation of alphabet ‘S’
Let’s look at each electrical parameter of audio signal:
Amplitude: A overall average amplitude of a an audio signal indicates how loud is the voice/audio signal. By using various electronics circuit design techniques you can normalize the amplitude either by amplifying or attenuating the signal, what is important is the amplification recognition should be applied uniformly across all frequency bands. One more important thing is, if there is an audio noise, or a mix of multiple audio sources, the decibel levels of unwanted audio and noise should be significantly lower than the required audio. Various signal processing techniques (both digital as well as analog) can be used to suppress or even eliminate noise.
Logarithmic system of measuring is used to measure sound. It is called decibel A scale measurement system. This is based on the relative loudness perceived by the human ear. A dBA value of below 10 is barely audible, example of below 10 is “pin dropping”. When we compare silence, we say “pin drop” silence, so that means it is below 20 dBA. A normal conversational speech comes in the range of 50- 70 dBA. To give you more idea on loudness, a rock band produce sound of 110 dBA, and a jet engine takeoff produces a sound of 140 dBA.
You can see in spectrogram plot of pronunciation of alphabet ‘D’ in one of the above pictures, number of multiple frequencies in the range of 200 to 400 Hz have higher amplitude compared to frequencies in the range of 2 to 4 Kilo hertz. So, for digital or any other electronic systems, to process an audio signal and decide as alphabet ‘D’, it should match the pattern either graphically or other mathematical methods.
So amplitude is a key parameter to recognize an audio signal.
Frequency: Anything below 20 Hz and above 20 KHz is not audible and should be eliminated. Much of the voice falls in the frequency range of 300 Hz to 4000 Hertz. The plain old telephone system used to work in the frequency range of 300 Hz to 3400 Hz. The human voice consists of multiple tones of frequencies. You can clearly make out from the pictures above where the spectrogram and CRO amplitude versus time plots for English alphabets ‘D’ and ‘S’ are so different in terms of frequency bands. When two alphabets have such a difference, you can imagine how complex it will be for a string of words. It looks a lot fuzzier. Mathematically we use Fourier transform or any such time to frequency transformers to find out various sinusoidal frequencies forming that signal. Modern digital electronics using DSPs can do this job of identifying frequencies.
So frequency vs time plot is centre part of speech recognition systems.
Timing: Since speech is a string of words and alphabets, the timing data of occurrence of a particular frequency and amplitude is another key attribute for speech recognition.
Speech recognition looks complex but doable. But the real challenge is the accent. That is computer should be able to recognize slightly misspelled words, that is a huge challenge. That’s where artificial intelligence tools come in.
In our next article let’s look into digitizing of analog audio signal.
Author: Srinivasa Reddy N