+ German Morphology
All human languages face the same problems, but solve them
differently. Disambiguating the exact relations between phrases can be
done in different ways; every language must adopt one or more of
different methods:
- words can be inflected, i.e. slightly changed, often at the end, to reflect systematically different usages. Example: `bist' refers to the hearer while `ist' refers to a third party. `laudat' talks about one actor, `laudant' about several.
- the order of words can be considered meaningful. `Bob loves Alice' and `Alice loves Bob' mean different things.
- different words can be selected to indicate relations. A Japanese clerk will use different words for `phone' in the sentence `Please phone us' and `We'll phone you'.
- dedicated special-purpose words can be used as markers. Many languages express spatial relationships with prepositions; other relationships can also be marked with special particles.
Note that the intrinsic meaning of words also helps very much in
disambiguation (a newspaper can publish an article but not vice
versa), but this happens independently of the surface form of an
utterance.
German uses both word order and inflection to disambiguate. However,
the morphological variants among words are usually ambiguous. Consider
the case of nouns: of the thirty-odd inflection classes for nouns, not
one has a full set of eight different endings for the eight possible
feature combinations; most have only two or three different endings.
This means that a given noun form usually has three or four possible
readings when taking morphology into account (and modern German is
clearly evolving towards an even poorer set of endings).
Taking this into account, should a grammar of German try to model
morphological variants of nouns? In some cases the information is
clearly useful to select the correct structure. In other cases, it
will merely lead to higher ambiguity as more alternatives must be
considered at each step. These variants seem useful:
- no morphological modelling: all noun forms are entered into the lexicon with underspecified features. Although the constraints impose agreement restrictions, these do not aid parsing because every noun will pass every test; on the other hand we represent the problem more compactly. We expect more speed at lower accuracy.
- partial modelling: Model only as many different forms as there actually are in the language, and leave the lexicon entries partially specified. Thus, almost all noun forms have gender information but most have underspecified case information. This should give some benefit to acuracy without slowdown.
- full modelling: all noun forms carry full case, gender and number information. Ambiguous forms occur several times in the lexicon. More distinctions can be made, but at a higher cost. We expect less speed for higher accuracy.
- selective full modelling: This is what the perfect morphosyntactic tagger would produce. All nouns are forced to carry exactly those features that the particular utterance requires. This should increase speed and accuracy.
Let us see how the word `Pferd' would be modeled under each of these
variants. The paradigm is as follows (inflection class n20):
Pferd Pferde
Pferd[e]s Pferde
Pferd[e] Pferden
Pferd Pferde
We have five different forms, with an ambiguity of 3, 1, 1, 4, and 1,
respectively. Without morphological modelling we would simply have one
underspecified entry for each form:
Pferd := [cat:NN,case:bot,number:bot,gender:bot];
Pferde := [cat:NN,case:bot,number:bot,gender:bot];
Pferds := [cat:NN,case:bot,number:bot,gender:bot];
Pferdes := [cat:NN,case:bot,number:bot,gender:bot];
Pferden := [cat:NN,case:bot,number:bot,gender:bot];
With conservative modelling we also have five forms, but each of these
carries as much information as possible (this requires that we know
the inflection class of each noun, which is why we didn't do it so
far):
Pferd := [cat:NN,case:nom_dat_acc,number:sg,gender:neut];
Pferde := [cat:NN,case:bot,number:bot,gender:neut];
Pferds := [cat:NN,case:gen,number:sg,gender:neut];
Pferdes := [cat:NN,case:gen,number:sg,gender:neut];
Pferden := [cat:NN,case:dat,number:pl,gender:neut];
Note 1: `Pferd' is marked
case:nom_dat_acc
, a special partially
specified case that the grammar needs to be aware of. This is useful
because at least we know that `Pferd' is not a genitive subordination.
Note 2: `Pferde' is marked
case:bot
because it might be a genitive
plural. Unfortunately this means that the mismatch between `des
Pferde' cannot be diagnosed. This is the drawback of partially
underspecified feature representation. (It could be cured by more
complicated representations that take into account the interplay
between different features, effectively combining them into one big
feature, but legibility suffers.)
Finally, with full modelling we have 10 entries because there are 10
morphological variants:
Pferd := [cat:NN,case:nom,number:sg,gender:neut];
Pferds := [cat:NN,case:gen,number:sg,gender:neut];
Pferdes := [cat:NN,case:gen,number:sg,gender:neut];
Pferd := [cat:NN,case:dat,number:sg,gender:neut];
Pferde := [cat:NN,case:dat,number:sg,gender:neut];
Pferd := [cat:NN,case:acc,number:sg,gender:neut];
Pferde := [cat:NN,case:nom,number:pl,gender:neut];
Pferde := [cat:NN,case:gen,number:pl,gender:neut];
Pferden := [cat:NN,case:dat,number:pl,gender:neut];
Pferde := [cat:NN,case:acc,number:pl,gender:neut];
The actual effect can only be determined by experiment, so let us do just
that. Once again, the `verkaufen' sentences (as of 27 Jan 2003) were
parsed with a time limit of 300s and the `dynamic' solution method.
And here are tonight's results:
- `abstract' nouns result in a parsing time of 15.6 seconds on the average, for a syntactical structural accuracy of 81.2%.
- `partial' nouns take 14.3 seconds and result in an accuracy of 81.9%.
- `full' nouns take much longer to parse: 26.4 seconds on the average, at an accuracy of 81.6%. This shows how unhelpful the German noun system really is.
- Having a perfect morphosyntactic tagger would improve on both results: 12.2 seconds are taken on average to produce an accuracy of 85.0%. (It is probable that a morphosyntactic tagger would have to be very nearly perfect to actually improve things, just like the category tagger, and we don't have such a beast.)
This means that it
does pay to know the inflection class for a noun:
parsing time decreases by 8% and accuracy increases by 1%.
--
KilianAFoth - 28 Jan 2003