Morphology-based language modeling for Amharic

NatS OberSeminar July 06, 2006 13:30 Uhr, F-235

Language models have a wide area of application in speech and natural language processing. Although they are widely used in speech recognition, they can also be applied in machine translation, character and handwriting recognition, spelling correction, etc. Since it is impossible to calculate conditional probabilities for all word sequence of arbitrary length in a given language, N-gram language models are generally used. Even, with N-gram models, it is not possible that all word sequences can be found in the training data. In particular for morphologically rich languages, there are even individual words that might not be encountered in the training data irrespective of how large it is. This is the problem of Out Of Vocabulary Words (OOV). The easiest solution to these problems would be to increase the amount of training data. But this is not feasible for morphologically rich languages. A promising alternatives is to abandon the word as a modeling unit and use sub-word units for the purpose of language modeling. We opted to develop a morpheme-based language model for Amharic, which is one of the morphologically rich languages. Since one word may be divided into different morphemes, calculating N-gram probability by just using morphemes as a unit leads to a loss of word level dependency. Thus, a way to capture word level dependencies has to be found.

« back
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback