+ Constraint Weights
The most persistently asked question about our parsing method is,
`Where do you get your weights from?'. Usually, the answer is, `We
just make them up.' This has proven to provoke discontent among the
askers; for some reason, the thought of a grammar writer weighting
constraints according to their own judgement is viewed as
unsatisfactory.
Of course, the CDG writer also has the even more arbitrary and
difficult task of
writing the constraints in the first place, but
strangely, no one ever asks, `Where do you get your formulas?'.
Apparently the notion is that in this day of corpus-based linguistics,
something as trivial as numbers should be learnable.
So far, we have persisted in attaching our weights by hand. It is
important to remember that the exact weight of a constraint is of
little importance. As long as a weight is not switched between nonzero
and zero (which can add or remove possibilities entirely from the
optimization problem), a conflict remains a conflict. Whether we
punish a misinflected determiner by 0.2 or 0.4 makes no great
difference; in a normal sentence, either penalty never occurs, whereas
in a misinflected sentence, the penalty cannot be avoided anyway.
The exact weight of a constraint comes into play only when we have to
decide between several structures, all of them with conflicts in them.
Suppose we can avoid the penalty for the misinflected determiner by
pointing it to NIL instead. This will give a better solution if the
combined penalty for the fragment and the missing determiner is still
less than that of the misinflection.
It is this kind of trade-off that is affected most by varying
constraint weights. Experimentation along these lines is expensive;
you have to parse an entire corpus many times to determine which
weight of a particular constraint works best. Experiments with the
weights of the tagger and fragment constraints showed that the effect
of varying a single weight is already impossible to predict
accurately.
In
this paper we
reported on experiments to compute
all constraint weights by
measuring how well particular vectors perform on a fixed corpus.
Although we found that you can reach the same level of accuracy and speed
as with manually selected weights, the process proved extremely
expensive -- ten thousands of runs across an entire corpus were
necessary. Note that this was a very small corpus with rather short
sentences; even so it took months.
Another idea would be to apply each constraint formula to a large
corpus and simply count how often it holds and fails; a constraint that
rarely fails should then receive a low weight and vice versa.
Let us see how exactly the computation should go. Take a constraint
like this:
{X!SYN} : Subjekt_Infinitiv_Numerus : agree : 0.1 :
X.label = SUBJ &
isa(X^,finit) &
isa(X@,Infinitiv)
->
X^number = sg;
This simply says that where an infinitive phrase is a subject, it
behaves like a singular NP, i.e. the verb should be in the singular.
This constraint is unlikely to fail in written language. Should we set
it to 0 simply because we have found no exception? If we don't, we
can't set any weights to 0, and that would blow up the optimization
problems beyond proportion. So let us say that a constraint without
exceptions in the corpus should be hard.
Now what do we do with a constraint that does have exceptions?
This constraint is at the opposite end of the scale:
{X!SYN} : Präpositionalattribut : category : 0.9 :
(X.label = PP | X.label = KOM)
->
~isa(X^,Nomen);
It says that a PP modifying an NP is dispreferred, but actually the
number of nouns vs. verbs carrying a PP is rather similar, certainly
in the same order of magnitude. The high score attached to this
constraint indicates that violating it cannot really be called a
language error; it is merely present as a guide in case the other
evidence equals out.
What weight would this constraint receive from a corpus count? If we
check 10000 syntactic edges and 400 of them violate the constraint,
then numerically it has a 4% chance of failing. But of course, in most
cases the constraint holds trivially; note that it can only fail at
all if the edge has a label of PP or KOM. It would probably be better
to count the number of cases where the constraint applies (i.e., the
premise holds) and the number of cases where the constraint fails (the
premise holds and the conclusion fails). On a typical corpus this
gives us a much more reasonable ratio of about 1:3.
The question remains which weight to extract from these figures. The
simplest formula is to divide the number of `fails' cases by the
number of `applies' cases, which yields a score of 1293 : 4612 =
0.28. This is very different from the 0.9 we had previously, but let
us accept it for the time being.
Which corpus should we use for the counting? Since we have large tree
banks of German now, it is tempting to use them. Unfortunately there
is a great problem here. The published NEGRA or TIGER corpora contain
lots of vocabulary that is not in our lexicon, so estimating the
weight of constraints that rely on lexicalized features will simply
not work. For instance, we cannot accurately estimate how often a
transitive verb loses its object, since many of the verbs in TIGER are
not in our lexicon, and hence we cannot tell whether or not they are
transitive.
A second problem is that both the grammar and the conversion process
are incomplete. The NEGRA corpus, when translated automatically to
dependency format, contains many structures that our grammar forbids
with penalty 0. If we use this material for counting, lots of hard
constraints would become slightly soft, which leads to an immediate
explosion of the search space: the ambiguity of a medium-length
sentence suddenly rises from 24 too 880, which effectively prevents
parsing from making any reasonable attempt to solve the problem. The
best we can do, then, is to use the 1739 hand-approved dependency
trees from the NEGRA corpus.
Look at these numbers for parsing the heise sentences:
manual: time 24.56, quality 82.312
heise: time 26.82, quality 82.028
negra: time 97.68, quality 68.674
negrahard: time 40.89, quality 79.200
First of all, the original constraint weights perform best of all
despite having been rather haphazardly selected from a small number of
possible values. Reestimating the weights based on the heise corpus
results in a small loss of performance, even though the training used
the same data as the evaluation.
When estimating on the `golden' negra sentences, results are much
worse. Two reasons can immediately be given for that: first, the
remaining hard conflicts in the not-quite-golden annotations (less
than one in ten sentences on average) are enough to make a large
number of constraints intended to be hard slightly soft and ruin
performance. We can avoid this problem by reverting to the earlier
policy of keeping hard constraints hard no matter what. This hardened
weight vector was used in the last experiment, and is only slightly
worse than the one counted from the heise sentences themselves.
What is more surprising is the much higher parsing time. This is the
second influence at work: some very weak preference constraints such
as
mod_direction
were slightly tightened by the counting process, and
as a result slipped below frobbing's `ignore' threshold. Where with
the original grammar, frobbing never even attempts to correct such
conflicts, now it does and usually fails, with a lot of time spent for
nothing. Of course we could avoid this effect by forbidding the
estimation to decrease a across this threshold, but the parsing
accuracy is unlikely to rise much because of that.
I see two possible conclusions: first, we can now honestly
answer, `We just make the weights up... but actually you can just estimate
them from a corpus, and it works almost as well.' But we have also
learnt that hand-tweaking can yield small further improvements, so we
should keep inventing weights along with constraint formulas, and
perhaps try out different weights if unsure.