| Modern general-purpose speech recognition | | | | serious company with a large amount of training |
| systems are generally based on hidden Markov | | | | data would probably want to consider |
| models (HMMs). This is a statistical model which | | | | discriminative training techniques like maximum |
| outputs a sequence of symbols or quantities. | | | | mutual information (MMI), MPE, or (for short |
| One possible reason why HMMs are used in | | | | utterances) MCE, and if a large amount of |
| speech recognition is that a speech signal could be | | | | speaker-specific enrollment data was available a |
| viewed as a piece-wise stationary signal or a | | | | more wholesale speaker adaptation could be done |
| short-time stationary signal. That is, one could | | | | using MAP or, at least, tree-based maximum |
| assume in a short-time in the range of 10 | | | | likelihood linear regression. Decoding of the speech |
| milliseconds, speech could be approximated as a | | | | (the term for what happens when the system is |
| stationary process. Speech could thus be thought | | | | presented with a new utterance and must |
| as a Markov model for many stochastic | | | | compute the most likely source sentence) would |
| processes (known as states). | | | | probably use the Viterbi algorithm to find the best |
| Another reason why HMMs are popular is because | | | | path, but there is a choice between dynamically |
| they can be trained automatically and are simple | | | | creating combination hidden Markov models which |
| and computationally feasible to use. In speech | | | | includes both the acoustic and language model |
| recognition, to give the very simplest setup | | | | information, or combining it statically beforehand |
| possible, the hidden Markov model would output a | | | | (the AT&T approach, for which their FSM toolkit |
| sequence of n-dimensional real-valued vectors | | | | might be useful). Those who value their sanity |
| with n around, say, 13, outputting one of these | | | | might consider the AT&T approach, but be |
| every 10 milliseconds. The vectors, again in the | | | | warned that it is memory hungry. |
| very simplest case, would consist of cepstral | | | | Neural network-based speech recognition |
| coefficients, which are obtained by taking a | | | | Another approach in acoustic modeling is the use |
| Fourier transform of a short-time window of | | | | of neural networks. They are capable of solving |
| speech and decorrelating the spectrum using a | | | | much more complicated recognition tasks, but do |
| cosine transform, then taking the first (most | | | | not scale as well as HMMs when it comes to large |
| significant) coefficients. The hidden Markov model | | | | vocabularies. Rather than being used in |
| will tend to have, in each state, a statistical | | | | general-purpose speech recognition applications |
| distribution called a mixture of diagonal covariance | | | | they can handle low quality, noisy data and |
| Gaussians which will give a likelihood for each | | | | speaker independence. Such systems can achieve |
| observed vector. Each word, or (for more general | | | | greater accuracy than HMM based systems, as |
| speech recognition systems), each phoneme, will | | | | long as there is training data and the vocabulary is |
| have a different output distribution; a hidden | | | | limited. A more general approach using neural |
| Markov model for a sequence of words or | | | | networks is phoneme recognition. This is an active |
| phonemes is made by concatenating the individual | | | | field of research, but generally the results are |
| trained hidden Markov models for the separate | | | | better than for HMMs. There are also NN-HMM |
| words and phonemes. | | | | hybrid systems that use the neural network part |
| The above is a very brief introduction to some of | | | | for phoneme recognition and the hidden markov |
| the more central aspects of speech recognition. | | | | model part for language modeling. |
| Modern speech recognition systems use a host of | | | | Dynamic time warping (DTW)-based speech |
| standard techniques which it would be too time | | | | recognition |
| consuming to properly explain, but just to give a | | | | Dynamic time warping is an algorithm for |
| flavor, a typical large-vocabulary continuous | | | | measuring similarity between two sequences |
| system would probably have the following parts. | | | | which may vary in time or speed. For instance, |
| It would need context dependency for the | | | | similarities in walking patterns would be detected, |
| phones (so phones with different left and right | | | | even if in one video the person was walking |
| context have different realizations); to handle | | | | slowly and if in another they were walking more |
| unseen contexts it would need tree clustering of | | | | quickly, or even if there were accelerations and |
| the contexts; it would of course use cepstral | | | | decelerations during the course of one |
| normalization to normalize for different recording | | | | observation. DTW has been applied to video, |
| conditions and depending on the length of time | | | | audio, and graphics -- indeed, any data which can |
| that the system had to adapt on different | | | | be turned into a linear representation can be |
| speakers and conditions it might use cepstral | | | | analysized with DTW. |
| mean and variance normalization for channel | | | | A well known application has been automatic |
| differences, vocal tract length normalization | | | | speech recognition, to cope with different |
| (VTLN) for male-female normalization and | | | | speaking speeds. In general, it is a method that |
| maximum likelihood linear regression (MLLR) for | | | | allows a computer to find an optimal match |
| more general speaker adaptation. The features | | | | between two given sequences (e.g. time series) |
| would have delta and delta-delta coefficients to | | | | with certain restrictions, i.e. the sequences are |
| capture speech dynamics and in addition might | | | | "warped" non-linearly to match each other. This |
| use heteroscedastic linear discriminant analysis | | | | sequence alignment method is often used in the |
| (HLDA); or might skip the delta and delta-delta | | | | context of hidden Markov models. |
| coefficients and use LDA followed perhaps by | | | | Knowledge-based speech recognition |
| heteroscedastic linear discriminant analysis or a | | | | This method uses a stored data base of |
| global semitied covariance transform (also known | | | | commands that compares simple words with |
| as maximum likelihood linear transform (MLLT)). A | | | | ones in the data base. |