To get to the answer it is necessary to understand the question.
What do we mean when we say voice/speech recognition?
Speech recognition is basically the task of identifying (in text form) what is being uttered by a speaker. The utterance can be an isolated word or sentence or may even be a paragraph.
Let us think of simple speech recognition system which can identify only isolated word say digits from 0 to 9. We will call this as digit recognition system.
What could be a naive approach for digit recognition system?
One simple approach would be to just compare waveforms i.e you store the waveforms of digits 0 to 9 and then just compare your test utterance waveform with all of the stored templates and determine which of the stored waveform is the closest match to your test waveform.
So, there you are, you are done with the task. But wait, is it so simple? And as you may guessed it the answer is NO.
So where's the fall?
We can't really compare the waveforms directly, Isn't it? Waveforms will be different even when the same person is uttering the same word at different times. Just think of how you utter the same word in different emotional states; sometimes you may stretch the word and at other times you may not. Correct?
Abstract Speech Recognition System
Okay, so now we at least know we can't compare waveforms directly. Hence we need to do some processing on speech waveform and convert it to some suitable form for comparison. This is called feature extraction i.e we extract some suitable features from our speech signal and then we compare them.