Latest inventions in natural language tools


How speech recognition works

Modern general-purpose speech recognitiondiscriminative training techniques like
systems are generally based on hidden Markovmaximum mutual information (MMI), MPE, or
models (HMMs). This is a statistical model(for short utterances) MCE, and if a large
which outputs a sequence of symbols oramount of speaker-specific enrollment data
quantities.was available a more wholesale speaker
adaptation could be done using MAP or, at
One possible reason why HMMs are used inleast, tree-based maximum likelihood linear
speech recognition is that a speech signalregression. Decoding of the speech (the term
could be viewed as a piece-wise stationaryfor what happens when the system is presented
signal or a short-time stationary signal.with a new utterance and must compute the
That is, one could assume in a short-time inmost likely source sentence) would probably
the range of 10 milliseconds, speech could beuse the Viterbi algorithm to find the best
approximated as a stationary process. Speechpath, but there is a choice between
could thus be thought as a Markov model fordynamically creating combination hidden
many  stochastic processes (known as states).Markov models which includes both the
acoustic and language model information, or
Another reason why HMMs are popular iscombining it statically beforehand (the AT&T
because they can be trained automatically andapproach, for which their FSM toolkit might
are simple and computationally feasible tobe useful). Those who value their sanity
use. In speech recognition, to give the verymight consider the AT&T approach, but be
simplest setup possible, the hidden Markovwarned  that  it  is  memory  hungry.
model would output a sequence of
n-dimensional real-valued vectors with nNeural  network-based  speech  recognition
around, say, 13, outputting one of these
every 10 milliseconds. The vectors, again inAnother approach in acoustic modeling is the
the very simplest case, would consist ofuse of neural networks. They are capable of
cepstral coefficients, which are obtained bysolving much more complicated recognition
taking a Fourier transform of a short-timetasks, but do not scale as well as HMMs when
window of speech and decorrelating theit comes to large vocabularies. Rather than
spectrum using a cosine transform, thenbeing used in general-purpose speech
taking the first (most significant)recognition applications they can handle low
coefficients. The hidden Markov model willquality, noisy data and speaker independence.
tend to have, in each state, a statisticalSuch systems can achieve greater accuracy
distribution called a mixture of diagonalthan HMM based systems, as long as there is
covariance Gaussians which will give atraining data and the vocabulary is limited.
likelihood for each observed vector. EachA more general approach using neural networks
word, or (for more general speech recognitionis phoneme recognition. This is an active
systems), each phoneme, will have a differentfield of research, but generally the results
output distribution; a hidden Markov modelare better than for HMMs. There are also
for a sequence of words or phonemes is madeNN-HMM hybrid systems that use the neural
by concatenating the individual trainednetwork part for phoneme recognition and the
hidden Markov models for the separate wordshidden markov model part for language
and  phonemes.modeling.
The above is a very brief introduction toDynamic time warping (DTW)-based speech
some of the more central aspects of speechrecognition
recognition. Modern speech recognition
systems use a host of standard techniquesDynamic time warping is an algorithm for
which it would be too time consuming tomeasuring similarity between two sequences
properly explain, but just to give a flavor,which may vary in time or speed. For
a typical large-vocabulary continuous systeminstance, similarities in walking patterns
would probably have the following parts. Itwould be detected, even if in one video the
would need context dependency for the phonesperson was walking slowly and if in another
(so phones with different left and rightthey were walking more quickly, or even if
context have different realizations); tothere were accelerations and decelerations
handle unseen contexts it would need treeduring the course of one observation. DTW has
clustering of the contexts; it would ofbeen applied to video, audio, and graphics --
course use cepstral normalization toindeed, any data which can be turned into a
normalize for different recording conditionslinear representation can be analysized with
and depending on the length of time that theDTW.
system had to adapt on different speakers and
conditions it might use cepstral mean andA well known application has been automatic
variance normalization for channelspeech recognition, to cope with different
differences, vocal tract length normalizationspeaking speeds. In general, it is a method
(VTLN) for male-female normalization andthat allows a computer to find an optimal
maximum likelihood linear regression (MLLR)match between two given sequences (e.g. time
for more general speaker adaptation. Theseries) with certain restrictions, i.e. the
features would have delta and delta-deltasequences are "warped" non-linearly to match
coefficients to capture speech dynamics andeach other. This sequence alignment method is
in addition might use heteroscedastic linearoften used in the context of hidden Markov
discriminant analysis (HLDA); or might skipmodels.
the delta and delta-delta coefficients and
use LDA followed perhaps by heteroscedasticKnowledge-based  speech  recognition
linear discriminant analysis or a global
semitied covariance transform (also known asThis method uses a stored data base of
maximum likelihood linear transform (MLLT)).commands that compares simple words with ones
A serious company with a large amount ofin the data base.
training data would probably want to consider



1 A B C D 58 59 60 61 62 63 64 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108