How speech recognition works

Modern general-purpose speech recognitionserious company with a large amount of training
systems are generally based on hidden Markovdata would probably want to consider
models (HMMs). This is a statistical model whichdiscriminative training techniques like maximum
outputs a sequence of symbols or quantities.mutual information (MMI), MPE, or (for short
One possible reason why HMMs are used inutterances) MCE, and if a large amount of
speech recognition is that a speech signal could bespeaker-specific enrollment data was available a
viewed as a piece-wise stationary signal or amore wholesale speaker adaptation could be done
short-time stationary signal. That is, one couldusing MAP or, at least, tree-based maximum
assume in a short-time in the range of 10likelihood linear regression. Decoding of the speech
milliseconds, speech could be approximated as a(the term for what happens when the system is
stationary process. Speech could thus be thoughtpresented with a new utterance and must
as a Markov model for many stochasticcompute the most likely source sentence) would
processes (known as states).probably use the Viterbi algorithm to find the best
Another reason why HMMs are popular is becausepath, but there is a choice between dynamically
they can be trained automatically and are simplecreating combination hidden Markov models which
and computationally feasible to use. In speechincludes both the acoustic and language model
recognition, to give the very simplest setupinformation, or combining it statically beforehand
possible, the hidden Markov model would output a(the AT&T approach, for which their FSM toolkit
sequence of n-dimensional real-valued vectorsmight be useful). Those who value their sanity
with n around, say, 13, outputting one of thesemight consider the AT&T approach, but be
every 10 milliseconds. The vectors, again in thewarned that it is memory hungry.
very simplest case, would consist of cepstralNeural network-based speech recognition
coefficients, which are obtained by taking aAnother approach in acoustic modeling is the use
Fourier transform of a short-time window ofof neural networks. They are capable of solving
speech and decorrelating the spectrum using amuch more complicated recognition tasks, but do
cosine transform, then taking the first (mostnot scale as well as HMMs when it comes to large
significant) coefficients. The hidden Markov modelvocabularies. Rather than being used in
will tend to have, in each state, a statisticalgeneral-purpose speech recognition applications
distribution called a mixture of diagonal covariancethey can handle low quality, noisy data and
Gaussians which will give a likelihood for eachspeaker independence. Such systems can achieve
observed vector. Each word, or (for more generalgreater accuracy than HMM based systems, as
speech recognition systems), each phoneme, willlong as there is training data and the vocabulary is
have a different output distribution; a hiddenlimited. A more general approach using neural
Markov model for a sequence of words ornetworks is phoneme recognition. This is an active
phonemes is made by concatenating the individualfield of research, but generally the results are
trained hidden Markov models for the separatebetter than for HMMs. There are also NN-HMM
words and phonemes.hybrid systems that use the neural network part
The above is a very brief introduction to some offor phoneme recognition and the hidden markov
the more central aspects of speech recognition.model part for language modeling.
Modern speech recognition systems use a host ofDynamic time warping (DTW)-based speech
standard techniques which it would be too timerecognition
consuming to properly explain, but just to give aDynamic time warping is an algorithm for
flavor, a typical large-vocabulary continuousmeasuring similarity between two sequences
system would probably have the following parts.which may vary in time or speed. For instance,
It would need context dependency for thesimilarities in walking patterns would be detected,
phones (so phones with different left and righteven if in one video the person was walking
context have different realizations); to handleslowly and if in another they were walking more
unseen contexts it would need tree clustering ofquickly, or even if there were accelerations and
the contexts; it would of course use cepstraldecelerations during the course of one
normalization to normalize for different recordingobservation. DTW has been applied to video,
conditions and depending on the length of timeaudio, and graphics -- indeed, any data which can
that the system had to adapt on differentbe turned into a linear representation can be
speakers and conditions it might use cepstralanalysized with DTW.
mean and variance normalization for channelA well known application has been automatic
differences, vocal tract length normalizationspeech recognition, to cope with different
(VTLN) for male-female normalization andspeaking speeds. In general, it is a method that
maximum likelihood linear regression (MLLR) forallows a computer to find an optimal match
more general speaker adaptation. The featuresbetween two given sequences (e.g. time series)
would have delta and delta-delta coefficients towith certain restrictions, i.e. the sequences are
capture speech dynamics and in addition might"warped" non-linearly to match each other. This
use heteroscedastic linear discriminant analysissequence alignment method is often used in the
(HLDA); or might skip the delta and delta-deltacontext of hidden Markov models.
coefficients and use LDA followed perhaps byKnowledge-based speech recognition
heteroscedastic linear discriminant analysis or aThis method uses a stored data base of
global semitied covariance transform (also knowncommands that compares simple words with
as maximum likelihood linear transform (MLLT)). Aones in the data base.