+ Constraint Weights

The most persistently asked question about our parsing method is, `Where do you get your weights from?'. Usually, the answer is, `We just make them up.' This has proven to provoke discontent among the askers; for some reason, the thought of a grammar writer weighting constraints according to their own judgement is viewed as unsatisfactory.

Of course, the CDG writer also has the even more arbitrary and difficult task of writing the constraints in the first place, but strangely, no one ever asks, `Where do you get your formulas?'. Apparently the notion is that in this day of corpus-based linguistics, something as trivial as numbers should be learnable.

So far, we have persisted in attaching our weights by hand. It is important to remember that the exact weight of a constraint is of little importance. As long as a weight is not switched between nonzero and zero (which can add or remove possibilities entirely from the optimization problem), a conflict remains a conflict. Whether we punish a misinflected determiner by 0.2 or 0.4 makes no great difference; in a normal sentence, either penalty never occurs, whereas in a misinflected sentence, the penalty cannot be avoided anyway.

The exact weight of a constraint comes into play only when we have to decide between several structures, all of them with conflicts in them. Suppose we can avoid the penalty for the misinflected determiner by pointing it to NIL instead. This will give a better solution if the combined penalty for the fragment and the missing determiner is still less than that of the misinflection.

It is this kind of trade-off that is affected most by varying constraint weights. Experimentation along these lines is expensive; you have to parse an entire corpus many times to determine which weight of a particular constraint works best. Experiments with the weights of the tagger and fragment constraints showed that the effect of varying a single weight is already impossible to predict accurately.

In this paper we reported on experiments to compute all constraint weights by measuring how well particular vectors perform on a fixed corpus. Although we found that you can reach the same level of accuracy and speed as with manually selected weights, the process proved extremely expensive -- ten thousands of runs across an entire corpus were necessary. Note that this was a very small corpus with rather short sentences; even so it took months.

Another idea would be to apply each constraint formula to a large corpus and simply count how often it holds and fails; a constraint that rarely fails should then receive a low weight and vice versa. Let us see how exactly the computation should go. Take a constraint like this:

{X!SYN} : Subjekt_Infinitiv_Numerus : agree : 0.1 :
  X.label = SUBJ &
  isa(X^,finit) &
  isa(X@,Infinitiv)
  ->
  X^number = sg;

This simply says that where an infinitive phrase is a subject, it behaves like a singular NP, i.e. the verb should be in the singular. This constraint is unlikely to fail in written language. Should we set it to 0 simply because we have found no exception? If we don't, we can't set any weights to 0, and that would blow up the optimization problems beyond proportion. So let us say that a constraint without exceptions in the corpus should be hard.

Now what do we do with a constraint that does have exceptions? This constraint is at the opposite end of the scale:

{X!SYN} : Präpositionalattribut : category : 0.9 :
 (X.label = PP | X.label = KOM)
 ->
 ~isa(X^,Nomen);

It says that a PP modifying an NP is dispreferred, but actually the number of nouns vs. verbs carrying a PP is rather similar, certainly in the same order of magnitude. The high score attached to this constraint indicates that violating it cannot really be called a language error; it is merely present as a guide in case the other evidence equals out.

What weight would this constraint receive from a corpus count? If we check 10000 syntactic edges and 400 of them violate the constraint, then numerically it has a 4% chance of failing. But of course, in most cases the constraint holds trivially; note that it can only fail at all if the edge has a label of PP or KOM. It would probably be better to count the number of cases where the constraint applies (i.e., the premise holds) and the number of cases where the constraint fails (the premise holds and the conclusion fails). On a typical corpus this gives us a much more reasonable ratio of about 1:3.

The question remains which weight to extract from these figures. The simplest formula is to divide the number of `fails' cases by the number of `applies' cases, which yields a score of 1293 : 4612 = 0.28. This is very different from the 0.9 we had previously, but let us accept it for the time being.

Which corpus should we use for the counting? Since we have large tree banks of German now, it is tempting to use them. Unfortunately there is a great problem here. The published NEGRA or TIGER corpora contain lots of vocabulary that is not in our lexicon, so estimating the weight of constraints that rely on lexicalized features will simply not work. For instance, we cannot accurately estimate how often a transitive verb loses its object, since many of the verbs in TIGER are not in our lexicon, and hence we cannot tell whether or not they are transitive.

A second problem is that both the grammar and the conversion process are incomplete. The NEGRA corpus, when translated automatically to dependency format, contains many structures that our grammar forbids with penalty 0. If we use this material for counting, lots of hard constraints would become slightly soft, which leads to an immediate explosion of the search space: the ambiguity of a medium-length sentence suddenly rises from 24 too 880, which effectively prevents parsing from making any reasonable attempt to solve the problem. The best we can do, then, is to use the 1739 hand-approved dependency trees from the NEGRA corpus.

Look at these numbers for parsing the heise sentences:

 manual:    time 24.56, quality 82.312
 heise:     time 26.82, quality 82.028
 negra:     time 97.68, quality 68.674
 negrahard: time 40.89, quality 79.200

First of all, the original constraint weights perform best of all despite having been rather haphazardly selected from a small number of possible values. Reestimating the weights based on the heise corpus results in a small loss of performance, even though the training used the same data as the evaluation.

When estimating on the `golden' negra sentences, results are much worse. Two reasons can immediately be given for that: first, the remaining hard conflicts in the not-quite-golden annotations (less than one in ten sentences on average) are enough to make a large number of constraints intended to be hard slightly soft and ruin performance. We can avoid this problem by reverting to the earlier policy of keeping hard constraints hard no matter what. This hardened weight vector was used in the last experiment, and is only slightly worse than the one counted from the heise sentences themselves.

What is more surprising is the much higher parsing time. This is the second influence at work: some very weak preference constraints such as mod_direction were slightly tightened by the counting process, and as a result slipped below frobbing's `ignore' threshold. Where with the original grammar, frobbing never even attempts to correct such conflicts, now it does and usually fails, with a lot of time spent for nothing. Of course we could avoid this effect by forbidding the estimation to decrease a across this threshold, but the parsing accuracy is unlikely to rise much because of that.

I see two possible conclusions: first, we can now honestly answer, `We just make the weights up... but actually you can just estimate them from a corpus, and it works almost as well.' But we have also learnt that hand-tweaking can yield small further improvements, so we should keep inventing weights along with constraint formulas, and perhaps try out different weights if unsure.
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback