GermanMorphology < CDG

UHH>Informatik>NatS>CDG Web>InternalPages>GrammarModelling>GermanMorphology (17 Oct 2012, UnknownUser) Print version

+ German Morphology

All human languages face the same problems, but solve them differently. Disambiguating the exact relations between phrases can be done in different ways; every language must adopt one or more of different methods:

words can be inflected, i.e. slightly changed, often at the end, to reflect systematically different usages. Example: `bist' refers to the hearer while `ist' refers to a third party. `laudat' talks about one actor, `laudant' about several.
the order of words can be considered meaningful. `Bob loves Alice' and `Alice loves Bob' mean different things.
different words can be selected to indicate relations. A Japanese clerk will use different words for `phone' in the sentence `Please phone us' and `We'll phone you'.
dedicated special-purpose words can be used as markers. Many languages express spatial relationships with prepositions; other relationships can also be marked with special particles.

Note that the intrinsic meaning of words also helps very much in disambiguation (a newspaper can publish an article but not vice versa), but this happens independently of the surface form of an utterance.

German uses both word order and inflection to disambiguate. However, the morphological variants among words are usually ambiguous. Consider the case of nouns: of the thirty-odd inflection classes for nouns, not one has a full set of eight different endings for the eight possible feature combinations; most have only two or three different endings. This means that a given noun form usually has three or four possible readings when taking morphology into account (and modern German is clearly evolving towards an even poorer set of endings).

Taking this into account, should a grammar of German try to model morphological variants of nouns? In some cases the information is clearly useful to select the correct structure. In other cases, it will merely lead to higher ambiguity as more alternatives must be considered at each step. These variants seem useful:

no morphological modelling: all noun forms are entered into the lexicon with underspecified features. Although the constraints impose agreement restrictions, these do not aid parsing because every noun will pass every test; on the other hand we represent the problem more compactly. We expect more speed at lower accuracy.
partial modelling: Model only as many different forms as there actually are in the language, and leave the lexicon entries partially specified. Thus, almost all noun forms have gender information but most have underspecified case information. This should give some benefit to acuracy without slowdown.
full modelling: all noun forms carry full case, gender and number information. Ambiguous forms occur several times in the lexicon. More distinctions can be made, but at a higher cost. We expect less speed for higher accuracy.
selective full modelling: This is what the perfect morphosyntactic tagger would produce. All nouns are forced to carry exactly those features that the particular utterance requires. This should increase speed and accuracy.

Let us see how the word `Pferd' would be modeled under each of these variants. The paradigm is as follows (inflection class n20):

     Pferd        Pferde
     Pferd[e]s    Pferde
     Pferd[e]     Pferden
     Pferd        Pferde

We have five different forms, with an ambiguity of 3, 1, 1, 4, and 1, respectively. Without morphological modelling we would simply have one underspecified entry for each form:

Pferd   := [cat:NN,case:bot,number:bot,gender:bot];
Pferde  := [cat:NN,case:bot,number:bot,gender:bot];
Pferds  := [cat:NN,case:bot,number:bot,gender:bot];
Pferdes := [cat:NN,case:bot,number:bot,gender:bot];
Pferden := [cat:NN,case:bot,number:bot,gender:bot];

With conservative modelling we also have five forms, but each of these carries as much information as possible (this requires that we know the inflection class of each noun, which is why we didn't do it so far):

Pferd   := [cat:NN,case:nom_dat_acc,number:sg,gender:neut];
Pferde  := [cat:NN,case:bot,number:bot,gender:neut];
Pferds  := [cat:NN,case:gen,number:sg,gender:neut];
Pferdes := [cat:NN,case:gen,number:sg,gender:neut];
Pferden := [cat:NN,case:dat,number:pl,gender:neut];

Note 1: `Pferd' is marked case:nom_dat_acc , a special partially specified case that the grammar needs to be aware of. This is useful because at least we know that `Pferd' is not a genitive subordination.

Note 2: `Pferde' is marked case:bot because it might be a genitive plural. Unfortunately this means that the mismatch between `des Pferde' cannot be diagnosed. This is the drawback of partially underspecified feature representation. (It could be cured by more complicated representations that take into account the interplay between different features, effectively combining them into one big feature, but legibility suffers.)

Finally, with full modelling we have 10 entries because there are 10 morphological variants:

Pferd   := [cat:NN,case:nom,number:sg,gender:neut];
Pferds  := [cat:NN,case:gen,number:sg,gender:neut];
Pferdes := [cat:NN,case:gen,number:sg,gender:neut];
Pferd   := [cat:NN,case:dat,number:sg,gender:neut];
Pferde  := [cat:NN,case:dat,number:sg,gender:neut];
Pferd   := [cat:NN,case:acc,number:sg,gender:neut];
Pferde  := [cat:NN,case:nom,number:pl,gender:neut];
Pferde  := [cat:NN,case:gen,number:pl,gender:neut];
Pferden := [cat:NN,case:dat,number:pl,gender:neut];
Pferde  := [cat:NN,case:acc,number:pl,gender:neut];

The actual effect can only be determined by experiment, so let us do just that. Once again, the `verkaufen' sentences (as of 27 Jan 2003) were parsed with a time limit of 300s and the `dynamic' solution method. And here are tonight's results:

`abstract' nouns result in a parsing time of 15.6 seconds on the average, for a syntactical structural accuracy of 81.2%.
`partial' nouns take 14.3 seconds and result in an accuracy of 81.9%.
`full' nouns take much longer to parse: 26.4 seconds on the average, at an accuracy of 81.6%. This shows how unhelpful the German noun system really is.
Having a perfect morphosyntactic tagger would improve on both results: 12.2 seconds are taken on average to produce an accuracy of 85.0%. (It is probable that a morphosyntactic tagger would have to be very nearly perfect to actually improve things, just like the category tagger, and we don't have such a beast.)

This means that it does pay to know the inflection class for a noun: parsing time decreases by 8% and accuracy increases by 1%.

-- KilianAFoth - 28 Jan 2003

CDG

Navigation

Publications

NatsWiki
Main
User
Sandbox
System

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback