| Natural language processing (NLP) is a | | | | |
| subfield of artificial intelligence and | | | | Do the girls look little? |
| linguistics. It studies the problems of | | | | |
| automated generation and understanding of | | | | Do the girls look pretty? |
| natural human languages. Natural language | | | | |
| generation systems convert information from | | | | Does the school look pretty? |
| computer databases into normal-sounding human | | | | |
| language, and natural language understanding | | | | Subproblems |
| systems convert samples of human language | | | | |
| into more formal representations that are | | | | Speech segmentation |
| easier for computer programs to manipulate. | | | | |
| | | | In most spoken languages, the sounds |
| Tasks and limitations | | | | representing successive letters blend into |
| | | | each other, so the conversion of the analog |
| In theory natural language processing is a | | | | signal to discrete characters can be a very |
| very attractive method of human-computer | | | | difficult process. Also, in natural speech |
| interaction. Early systems such as SHRDLU, | | | | there are hardly any pauses between |
| working in restricted "blocks worlds" with | | | | successive words; the location of those |
| restricted vocabularies, worked extremely | | | | boundaries usually must take into account |
| well, leading researchers to excessive | | | | grammatical and semantical constraints, as |
| optimism which was soon lost when the systems | | | | well as the context. |
| were extended to more realistic situations | | | | |
| with real-world ambiguity and complexity. | | | | Text segmentation |
| | | | |
| Natural language understanding is sometimes | | | | Some written languages like Chinese, |
| referred to as an AI-complete problem, | | | | Japanese and Thai do not have single word |
| because natural language recognition seems to | | | | boundaries either, so any significant text |
| require extensive knowledge about the outside | | | | parsing usually requires the identification |
| world and the ability to manipulate it. The | | | | of word boundaries, which is often a |
| definition of "understanding" is one of the | | | | non-trivial task. |
| major problems in natural language | | | | |
| processing. | | | | Word sense disambiguation |
| | | | |
| Concrete problems | | | | Many words have more than one meaning; we |
| | | | have to select the meaning which makes the |
| Some examples of the problems faced by | | | | most sense in context. |
| natural language understanding systems: | | | | |
| | | | Syntactic ambiguity |
| The sentences We gave the monkeys the bananas | | | | |
| because they were hungry and We gave the | | | | The grammar for natural languages is |
| monkeys the bananas because they were | | | | ambiguous, i.e. there are often multiple |
| over-ripe have the same surface grammatical | | | | possible parse trees for a given sentence. |
| structure. However, in one of them the word | | | | Choosing the most appropriate one usually |
| they refers to the monkeys, in the other it | | | | requires semantic and contextual information. |
| refers to the bananas: the sentence cannot be | | | | Specific problem components of syntactic |
| understood properly without knowledge of the | | | | ambiguity include sentence boundary |
| properties and behaviour of monkeys and | | | | disambiguation. |
| bananas. | | | | |
| | | | Imperfect or irregular input |
| A string of words may be interpreted in | | | | |
| myriad ways. For example, the string Time | | | | Foreign or regional accents and vocal |
| flies like an arrow may be interpreted in a | | | | impediments in speech; typing or grammatical |
| variety of ways: | | | | errors, OCR errors in texts. |
| | | | |
| time moves quickly just like an arrow does; | | | | Speech acts and plans |
| | | | |
| measure the speed of flying insects like you | | | | Sentences often don't mean what they |
| would measure that of an arrow - i.e. (You | | | | literally say; for instance a good answer to |
| should) time flies like you would an arrow.; | | | | "Can you pass the salt" is to pass the salt; |
| | | | in most contexts "Yes" is not a good answer, |
| measure the speed of flying insects like an | | | | although "No" is better and "I'm afraid that |
| arrow would - i.e. Time flies in the same way | | | | I can't see it" is better yet. Or again, if a |
| that an arrow would (time them).; | | | | class was not offered last year, "The class |
| | | | was not offered last year" is a better answer |
| measure the speed of flying insects that are | | | | to the question "How many students failed the |
| like arrows - i.e. Time those flies that are | | | | class last year?" than "None" is. |
| like arrows; | | | | |
| | | | Statistical NLP |
| a type of flying insect, "time-flies," enjoy | | | | |
| arrows (compare Fruit flies like a banana.) | | | | Statistical natural language processing uses |
| | | | stochastic, probabilistic and statistical |
| English is particularly challenging in this | | | | methods to resolve some of the difficulties |
| regard because it has little inflectional | | | | discussed above, especially those which arise |
| morphology to distinguish between parts of | | | | because longer sentences are highly ambiguous |
| speech. | | | | when processed with realistic grammars, |
| | | | yielding thousands or millions of possible |
| English and several other languages don't | | | | analyses. Methods for disambiguation often |
| specify which word an adjective applies to. | | | | involve the use of corpora and Markov models. |
| For example, in the string "pretty little | | | | The technology for statistical NLP comes |
| girls' school". | | | | mainly from machine learning and data mining, |
| | | | both of which are fields of artificial |
| Does the school look little? | | | | intelligence that involve learning from data. |