| Speech recognition (in many contexts also | | | | Most speech recognition users would tend to |
| known as 'automatic speech recognition', | | | | agree that dictation machines can achieve |
| computer speech recognition or erroneously as | | | | very high performance in controlled |
| Voice Recognition) is the process of | | | | conditions. Part of the confusion mainly |
| converting a speech signal to a sequence of | | | | comes from the mixed usage of the term speech |
| words, by means of an algorithm implemented | | | | recognition and dictation. |
| as a computer program. | | | | |
| | | | Speaker-dependent dictation systems requiring |
| Speech recognition applications that have | | | | a short period of training can capture |
| emerged over the last few years include voice | | | | continuous speech with a large vocabulary at |
| dialing (e.g., Call home), call routing | | | | normal pace with a very high accuracy. Most |
| (e.g., I would like to make a collect call), | | | | commercial companies claim that recognition |
| simple data entry (e.g., entering a credit | | | | software can achieve between 98% to 99% |
| card number), preparation of structured | | | | accuracy (getting one to two words out of one |
| documents (e.g., a radiology report), domotic | | | | hundred wrong) if operated under optimal |
| appliances control and content-based spoken | | | | conditions. These optimal conditions usually |
| audio search (e.g. find a podcast where | | | | means the test subjects have 1) matching |
| particular words were spoken). | | | | speaker characteristics with the training |
| | | | data, 2) proper speaker adaptation, and 3) |
| Voice recognition or speaker recognition is a | | | | clean environment (e.g. office space). (This |
| related process that attempts to identify the | | | | explains why some users, especially accented, |
| person speaking, as opposed to what is being | | | | might actually find that the recognition rate |
| said. | | | | could be perceptually much lower than the |
| | | | expected 98% to 99%). |
| Speech recognition technology | | | | |
| | | | Other, limited vocabulary, systems requiring |
| In terms of technology, most of the technical | | | | no training can recognize a small number of |
| text books nowadays emphasize the use of | | | | words (for instance, the ten digits) from |
| Hidden Markov Model as the underlying | | | | most speakers. Such systems are popular for |
| technology. The dynamic programming approach, | | | | routing incoming phone calls to their |
| the neural network-based approach and the | | | | destinations in large organizations. |
| knowledge-based learning approach have been | | | | |
| studied intensively in the 1980s and 1990s. | | | | Both acoustic modeling and language modeling |
| | | | are important studies in modern statistical |
| Performance of speech recognition systems | | | | speech recognition. In this entry, we will |
| | | | focus on explaining the use of hidden Markov |
| The performance of a speech recognition | | | | model (HMM) because notably it is very widely |
| systems is usually specified in terms of | | | | used in many systems. (Language modeling has |
| accuracy and speed. Accuracy is measured with | | | | many other applications such as smart |
| the word error rate, whereas speed is | | | | keyboard and document classification; please |
| measured with the real time factor. | | | | refer to the corresponding entries.) |
| | | | |