Morphology-based language modeling for Amharic
Language models have a wide area of application in speech and natural language processing. Although they are widely used in speech recognition, they can also be applied in machine translation, character and handwriting recognition, spelling correction, etc. Since it is impossible to calculate conditional probabilities for all word sequence of arbitrary length in a given language, N-gram language models are generally used. Even, with N-gram models, it is not possible that all word sequences can be found in the training data. In particular for morphologically rich languages, there are even individual words that might not be encountered in the training data irrespective of how large it is. This is the problem of Out Of Vocabulary Words (OOV). The easiest solution to these problems would be to increase the amount of training data. But this is not feasible for morphologically rich languages. A promising alternatives is to abandon the word as a modeling unit and use sub-word units for the purpose of language modeling. We opted to develop a morpheme-based language model for Amharic, which is one of the morphologically rich languages. Since one word may be divided into different morphemes, calculating N-gram probability by just using morphemes as a unit leads to a loss of word level dependency. Thus, a way to capture word level dependencies has to be found.
« back