| Modern general-purpose speech recognition | | | | discriminative training techniques like |
| systems are generally based on hidden Markov | | | | maximum mutual information (MMI), MPE, or |
| models (HMMs). This is a statistical model | | | | (for short utterances) MCE, and if a large |
| which outputs a sequence of symbols or | | | | amount of speaker-specific enrollment data |
| quantities. | | | | was available a more wholesale speaker |
| | | | adaptation could be done using MAP or, at |
| One possible reason why HMMs are used in | | | | least, tree-based maximum likelihood linear |
| speech recognition is that a speech signal | | | | regression. Decoding of the speech (the term |
| could be viewed as a piece-wise stationary | | | | for what happens when the system is presented |
| signal or a short-time stationary signal. | | | | with a new utterance and must compute the |
| That is, one could assume in a short-time in | | | | most likely source sentence) would probably |
| the range of 10 milliseconds, speech could be | | | | use the Viterbi algorithm to find the best |
| approximated as a stationary process. Speech | | | | path, but there is a choice between |
| could thus be thought as a Markov model for | | | | dynamically creating combination hidden |
| many stochastic processes (known as states). | | | | Markov models which includes both the |
| | | | acoustic and language model information, or |
| Another reason why HMMs are popular is | | | | combining it statically beforehand (the AT&T |
| because they can be trained automatically and | | | | approach, for which their FSM toolkit might |
| are simple and computationally feasible to | | | | be useful). Those who value their sanity |
| use. In speech recognition, to give the very | | | | might consider the AT&T approach, but be |
| simplest setup possible, the hidden Markov | | | | warned that it is memory hungry. |
| model would output a sequence of | | | | |
| n-dimensional real-valued vectors with n | | | | Neural network-based speech recognition |
| around, say, 13, outputting one of these | | | | |
| every 10 milliseconds. The vectors, again in | | | | Another approach in acoustic modeling is the |
| the very simplest case, would consist of | | | | use of neural networks. They are capable of |
| cepstral coefficients, which are obtained by | | | | solving much more complicated recognition |
| taking a Fourier transform of a short-time | | | | tasks, but do not scale as well as HMMs when |
| window of speech and decorrelating the | | | | it comes to large vocabularies. Rather than |
| spectrum using a cosine transform, then | | | | being used in general-purpose speech |
| taking the first (most significant) | | | | recognition applications they can handle low |
| coefficients. The hidden Markov model will | | | | quality, noisy data and speaker independence. |
| tend to have, in each state, a statistical | | | | Such systems can achieve greater accuracy |
| distribution called a mixture of diagonal | | | | than HMM based systems, as long as there is |
| covariance Gaussians which will give a | | | | training data and the vocabulary is limited. |
| likelihood for each observed vector. Each | | | | A more general approach using neural networks |
| word, or (for more general speech recognition | | | | is phoneme recognition. This is an active |
| systems), each phoneme, will have a different | | | | field of research, but generally the results |
| output distribution; a hidden Markov model | | | | are better than for HMMs. There are also |
| for a sequence of words or phonemes is made | | | | NN-HMM hybrid systems that use the neural |
| by concatenating the individual trained | | | | network part for phoneme recognition and the |
| hidden Markov models for the separate words | | | | hidden markov model part for language |
| and phonemes. | | | | modeling. |
| | | | |
| The above is a very brief introduction to | | | | Dynamic time warping (DTW)-based speech |
| some of the more central aspects of speech | | | | recognition |
| recognition. Modern speech recognition | | | | |
| systems use a host of standard techniques | | | | Dynamic time warping is an algorithm for |
| which it would be too time consuming to | | | | measuring similarity between two sequences |
| properly explain, but just to give a flavor, | | | | which may vary in time or speed. For |
| a typical large-vocabulary continuous system | | | | instance, similarities in walking patterns |
| would probably have the following parts. It | | | | would be detected, even if in one video the |
| would need context dependency for the phones | | | | person was walking slowly and if in another |
| (so phones with different left and right | | | | they were walking more quickly, or even if |
| context have different realizations); to | | | | there were accelerations and decelerations |
| handle unseen contexts it would need tree | | | | during the course of one observation. DTW has |
| clustering of the contexts; it would of | | | | been applied to video, audio, and graphics -- |
| course use cepstral normalization to | | | | indeed, any data which can be turned into a |
| normalize for different recording conditions | | | | linear representation can be analysized with |
| and depending on the length of time that the | | | | DTW. |
| system had to adapt on different speakers and | | | | |
| conditions it might use cepstral mean and | | | | A well known application has been automatic |
| variance normalization for channel | | | | speech recognition, to cope with different |
| differences, vocal tract length normalization | | | | speaking speeds. In general, it is a method |
| (VTLN) for male-female normalization and | | | | that allows a computer to find an optimal |
| maximum likelihood linear regression (MLLR) | | | | match between two given sequences (e.g. time |
| for more general speaker adaptation. The | | | | series) with certain restrictions, i.e. the |
| features would have delta and delta-delta | | | | sequences are "warped" non-linearly to match |
| coefficients to capture speech dynamics and | | | | each other. This sequence alignment method is |
| in addition might use heteroscedastic linear | | | | often used in the context of hidden Markov |
| discriminant analysis (HLDA); or might skip | | | | models. |
| the delta and delta-delta coefficients and | | | | |
| use LDA followed perhaps by heteroscedastic | | | | Knowledge-based speech recognition |
| linear discriminant analysis or a global | | | | |
| semitied covariance transform (also known as | | | | This method uses a stored data base of |
| maximum likelihood linear transform (MLLT)). | | | | commands that compares simple words with ones |
| A serious company with a large amount of | | | | in the data base. |
| training data would probably want to consider | | | | |