PP Attachment
Prepositional phrases can attach at many places in a parse tree, and
which attachment site is correct is a difficult decision (even human
annotators usually disagree about 10% of the data). Hence, partial
parsers typically do not make it. Other related problems exist such as
attachment of adverbs or conjunctional phrases, but pp attachment is
the poster child of syntactic ambiguity resolution. In fact, it was to
represent the large number of possible pp attachments that Hiroshi
Maruyama first introduced CDG.
Here are some research reports about solving the pp attachment task
automatically.
Whittemore, Ferrara & Brunner 1990:
Empirical study of Predictive Powers of Simple Attachment
Schemes for Post-modifier Prepositional Phrases
The authors contrast the performance of traditional linguistic
attachment theories such as Minimal Attachment with their own scheme
of Lexical Preference, which they find to be superior. In the
restricted-domain, typed-input, low-speed context of travel planning
assistance, "attachments are essentially always predictable".
However, their procedure is a complicated multi-step algorithm using
hand-annotated features of the words, such as "this noun takes a
temporal qualifier".
Donald Hindle & Mats Rooth 1993:
Structural Ambiguity and Lexical Relations
This is the seminal paper about automatic pp attachment. It formulates
the canonical task: given a verb, a following noun and a preposition,
decide whether the preposition attaches to the verb or the noun. (Note
that this is a lot easier than what a parser actually needs to do.)
From the output from a partial parser, 224,000 triples were extracted
and used as training material. The Lexical Attraction score is
essentially the logged ratio of the frequencies of verb/noun
attachment for a particular triple. The method achieves 80%
correctness, where human annotators achieve 85% (88% if they can see
the entire example and not only the three head words).
Adwait Ratnaparkhi, Jeff Reynar and Salim Roukos 1994:
A Maximum Entropy Model for Prepositional Phrase Attachment
Here the problem is extended from judging a triple of words to a
quadruple by also looking at the object of the preposition in
question. The task is solved with a maximum entropy model, and 81.6%
correctness are reported on the WSJ corpus.
Eric Brill and Philip Resnik 1994:
A Rule-Based Approach to Prepositional Phrase Structure
Attachment Disambiguation
Eric Brill, of course, solves the same problem by transformation-based
error-driven learning. He achieves 80.8% accuracy with 471
transformations. Interestingly, he then uses the noun classes from
WordNet to create more general rules, and achieves 81.8% accuracy
with only 266 rules. The pros and cons are much the same as for
transformation-based POS tagging: huge learning time, and the method
is supervised-only.
Michael Collins and James Brooks 1995:
Prepositional Phrase Attachment through a Backed-Off Model
In a surprising step backward, Collins and Brooks solve the attachment
problem even better by simple tuple-counting. When no instance of a
4-tuple is known from training, they back off to triples and pairs,
They show that you should back off only to those tuples that contain
the preposition. The method achieves 84.5% accuracy. They also note
that, contrary to popular wisdom, low-frequency tuples should
not be
discarded, since they contribute about 3% to the results.
Adwait Ratnaparkhi 1997:
A Simple Introduction to Maximum Entropy Models for Natural
Language Processing
This is not about pp attachment per se, but is a very good explanation
of what maximum entropy means and how to compute it.
Jiri Stetina and Makato Nagao 1997:
Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary
This work uses an ID3 decision tree to predict noun or verb attachment
on the same now-standard data, and achieves an unsurpassed 88.1% of
accuracy. The authors explain that they were inspired by the fact that
Collins's backed-off model performs at 92.6% on instances when it does
not have to back off (i.e. the entire quadruple has been seen before).
Hence, one should try to increase the number of times that full
matches happen. For instance, `buy books for children' should be
considered as the same as `buy magazines for children'.
To do this, the WordNet hierarchy was employed and a concept of
semantic distance was introduced (a mixture of path length between two
nodes and their absolute depths). An entire unsupervised algorithm for
word sense disambiguation was invented to prepare the training data,
since hand-disambiguating word senses in the WSJ corpus is infeasible.
The resulting decision tree is extremely efficient and performs almost
as well as human readers (when they see only the quadruple).
Paola Merlo, Matthew W. Crocker, and Cathy Berthouzoz 1997:
Attaching Multiple Prepositional Phrases: Generalized Backed-off
Estimation
This work generalizes Collins's back-off model to deal with more than
one PP. (The main differences are that the kernel noun is never used.)
If a second PP must be attached, it is estimated under the assumption
that the first PP stays where it was already attached. In this way the
rich triple counts from the original algorithm can be reused for
further decisions rather than the extremely poor 6-tuple and 8-tuple
counts one would otherwise have to use. This is justified by observing
that the distance of PP to attachment does not change the attachment
behaviour very much.
Collins's figure is recreated for the first PP, but the second is only
correct 69.6% of the time and the third, 43.6%. This is explained by
both the larger space of possibilities and the sparser data for
training.
Michael Niemann 1998:
Determining PP Attachment through Semantic Associations and Preferences
Niemann employs semantic classes (such as `action' or `artifact')
instead of individual words to learn attachment preferences. His
Lexass
PP parser operates in several stages and makes a decision
whenever the criteria of the current stage seem decisive (are above
the cutoff). It can resolve multi-way ambiguities rather than the
one-bit problem usually posed. On two-way problems, it also achieves
80% when using all available data, but in all other instances much
less. The method requires additional hand-annotation of the training
corpus for word sense and syntactic function.
Adwait Ratnaparkhi 1998:
Statistical Models for Unsupervised Prepositional Phrase Attachment
This approach again uses the bigram count method to achieve 81.9%
accuracy. The new technique is to use unannotated text, but about 50
times as much of it as ever before, and select only those cases where
the attachment is unambiguous even without annotation (e.g.:
`lawyers in other jurisdictions'). With this trick, unsupervised
learning of attachment becomes as powerful as supervised learning.
The results are presented for English and Spanish, but
unfortunately none of the heuristics work for German.
Deniz Yuret 1998:
Discovery of Linguistic Relations Using Lexical Attraction
In his PhD thesis, Yuret builds a program that discovers relations
between words in free text through lexical attraction. Through
feedback from the processor to the memory of the system, even
long-distance relations can be found. His representation differs from
dependency grammar in that the relations are always undirected. He
gives impressive success stories (a perfect dependency tree for the
sentence `these people also want more government money for education',
with no prior information whatsoever), but the overall accuracy is
only 50% even between function words.
Josep M. Sopena; Agusti LLoberas; Joan L. Moliner 1998:
A Connectionist Approach to Prepositional Phrase Attachment for
Real World Texts
The quadruple task is solved on examples from the WSJ corpus using a
feed-forward neural network with one hidden layer. Wordnet is used but
simplified to those classes that actually occur in the training ans
test sets. All classes of all senses of all words in the quadruple are
presented as input simultaneously (and the ouput is a single bit, V or
not V). 86.8% are achieved in the best configuration.
The authors ascribe this improvement over the previous result using
only class information to a better use of class hierarchies. For
instance, Wordnet is called an inadequate hierarchy since it is
handcrafted. This approach can afford to present all components of the
quadruple simlutaneously because the parameter space is kept small by
using classes rather than words themselves.
It sounds very surprising that the identity of verbs should be
completely ignored. After all, `live' and `thrive' are fairly close in
meaning, but `live' selects `by' while `thrive' prefers `on'. However,
`Verbs can be classified on the basis of the kind of prepositions they
select', and that is in fact done for the best figures. That would
seem to mean that the technique does
not use only classes, as I
understand that word. It is also possible that English is more
logical in pairing senses with prepositions (rather than lexemes with
prepositions, as German does); in fact, both `live on' and `thrive by'
are perfectly acceptable as well. Interesting.
Patrick Pantel and Dekang Lin 2000:
An Unsupervised Approach to Prepositional Phrase Attachment using
Contextually Similar Words
This paper treats the same 3097 WSJ quadruples as Rathnaparkhi six
years previously and achieves similar results, but with only
unsupervised methods: 125,000,000 words of newspaper text were parsed
with
Minipar, which creates disambiguated dependency trees. A
special-purpose helper program finds additional possible attachment
sites. A training corpus of unambiguous examples is then constructed
by using only those sentences in which the helper fails.
Classification is done via Roth's generalized feature model which
allows up to 15 features (namely all possible subsequences of V, N1, P
and N2), but only features with P in it are used since Collins found
that those are most relevant. Also, features with both V and N1 in
them are discarded. To increase the number of matches,
contextually
similar words are computed by observing which lexical items appear
in the same relation to a particular word. This allows the set
eat:cook,drink,consume etc. to be extracted. The final result is 84.6%
accuracy once obvious errors in the test set (
the tagged as noun)
are removed.
Dimitrios Kokkinakis 2000:
Supervised PP-Attachment for Swedish (Combining Unsupervised and
Supervised Training Data)
Kokkinakis solves the quadruple task for Swedish with Memory-Based
Learning. Part of the training data comes from the extensively
annotated Gothenburg Lexical Database, which includes valency frames
for verbs etc. Other training material was extracted from raw text
exploiting the heuristic `the attachment is always V if N1 is a
pronoun', and yet more by manually reviewing the output of the shallow
parser Cass-SWE. From a large number of metrics and algorithms, the
best result on 207 test cases is 70.4% when using only GLDB
information, and 86.4% when using all information available.
The author claims that the `comparable and even slightly better'
results are due to the multiple sources of information and
lemmatization. But it is more likely that lemmatization is necessary
in Swedish but not in English, and that the higher number is due to
extensive parameter tuning and other limiting factors - for instance,
the 38 examples that the human arbiters disagreed on were excluded
from the test set, which of course makes the task easier. The entire
writeup is somewhat confused, the author claiming both that `The MBL
method is not sensitive to sparse data' and `a disadvantage with MBL
[...] is that the MBL's learning component requires a large number of
instances [...] for the good performance of the algorithms.'
Mark McLauchlin 2001:
Maximum Entropy Models and Prepositional Phrase Ambiguity
This thesis solves the binary attachment problem once more with
maximum entropy techniques, bringing the final performance to 85.5%.
The new contribution is to bring in knowledge from
latent semantic
analysis (finding lexical preferences from large unlabelled
corpora). Ratnaparkhi's 1994 data set is used - it seems to have
become a standard for this kind of task.
Martin Volk 2001:
Exploiting the WWW as a corpus to resolve PP Attachment Ambiguities
This appears to be both the first application of PP attachment
techniques to German, and the first to use search engine queries as
its corpus. 3000 sentences from the
Computerwoche CD were
decorated with the usual quintuples as test data. AltaVista's
NEAR
operator was used to estimate co-occurrence frequencies in the
entire German WWW. The work focuses on trading off reliability and
coverage, and so not all cases are always solved. In the final
configuration, lemmatization, compound analysis and backing off to one
data point is used, and all remaining cases are simply assigned to
N
. This achieves 73.1% accuracy at full coverage.
Martin Volk 2002:
Combining unsupervised and supervised methods for PP attachment
disambiguation
This paper anticipates my of our own decisions in PP attachment for
German. Volk uses most of the same obvious steps as we did:
co-occurrence frequencies as probability estimates, de-compounding,
lemmatization, and even the very same extraction method (prepositions
immediately following a noun or in the same clause as a verb,
and
the same noun boost factor: 5). 5803 NEGRA quadruples are used for
supervised training and
Computer-Zeitung for unsupervised training.
Comparisons of both show that supervised learning is better in
principle but not actually because of sparse data; so a 9-step
back-off algorithm is finally used that tries in turn to use FVG
triples (`zur Anwendung kommen'), then supervised quadruples,
then supervised triples, then unsupervised triples, etc. The final
result is 80.98%, which might once again indicate that German word
order makes the task harder than in English.
Eric Gaussier and Nicola Cancedda 2002:
Probabilistic models for PP-attachment Resolution and NP Analysis
The authors present a probabilistic model that generalizes several
previous approaches and explicitly list the independence assumptions
that it implies, e.g. "The possible tree shapes are uniformly
distributed", "The POS of a nucleus depends only on its semantic
class". A `nucleus' is a word sequence that contains the lexical head
whose internal structure is considered unambiguous; nuclei are
generalized to
safe chains: sequences of nuclei where all
following nuclei are attached to other preceding nuclei. A complex
generative model of language is constructed that can be estimated via
Maximum Likelihood. Using French dictionaries with extensive
subcategorization info, PP quadruple attachment achieved 85% accuracy
on sentences from
Le Monde. But general inter-nucleus
attachment achieves only 73.7%, which is not much more than the 72.1%
baseline.
Hang Li 2002:
Word Clustering and Disambiguation Based on Co-occurrence Data
This work addresses the quadruple task using word counts from the
tagged (not the parsed) WSJ corpus (126,000 sentences). An algorithm
called 2D-clustering is used which merges two noun classes and two
verb classes in alternation, while minimizing the total data
description length (this is equivalent with maximizing the mutual
information). By using co-occurrence in statistically extracted verb
phrases, very plausible groupings are achieved such as
"acquire/buy/hold/purchase/sell" or "share/asset". Ten-fold
cross-validation is then performed over the parsed WSJ corpus.
Unfortunately, the results are very difficult to decipher. Li claims
an accuracy of 98.6% at a coverage of 50.8%, and an accuracy of 76.2%
at total coverage, but the abstract claims 88.2%. That number,
however, only appears in the table "Comparison with previous work",
where 88.2% is assigned to the "Back-Off" method, a complex five-step
method which involves 2D clustering, previous work by Li and Abe
involving both an automatically extracted thesaurus and WordNet,
and further back-off techniques. (Presumably this is at a coverage of
100%, and the table caption is just misleading?)
Núria Gala Pavia and Salah Aït-Mokhtar 2003:
Lexicalising a robust grammar parser using the WWW
This contribution seems extremely exciting at first glance, since the
authors use a handwritten robust dependency-based parser of French
with soft rules (XIL). However, no explanation of the actual rules is
found here, no accuracy is ever given for full parsing (only for PP
attachment), and the test corpus comprises 89 sentences only.
The novelty here is that ambiguously parsed PP attachments are fed to
the search engine Altavista in the format X Prep NEAR N, for instance:
"(impact sur) NEAR comptes". The resulting documents are then parsed
again, and only those in which the parser indicates that this is
actually a PP and not a mere nearness are counted. For instance, to
estimate how probable the PP (impact+sur+comptes) is, the
phrase `impact sur les comptes' are counted as a positive example, but
`impact sur la fermeture de vos comptes' is not.
These counts are then used to form a preference lexicon which is used
in a second run of the entire parsing system. This increases PP
attachment precision from 71% to 83%, but decreases recall from 92% to
85%.
Lee Schwartz, Takako Aikawa and Chris Quirk 2003:
Disambiguation of English PP Attachment using Multilingual Aligned
Data
A bilingual aligned English-Japanese corpus was used to improve
PP attachment in machine translation. The key is that Japanese is
syntactically unambiguous with respect to the classical quadruple
task, so an aligned bilingual J-E corpus can be used as a corpus that
is annotated as far as PP attachment is concerned. The MSR-MT system
produces low PP attachments by default, so the corpus was used merely
to collect information on when a verb has greater affinity for a PP
than a noun does. With an additional dictionary of such V+P pairs, the
resulting translations were judged by human informers to be
significantly better, whether they were from English to Japanese or to
Spanish.
Mark McLauchlan 2004:
Thesauruses for Prepositional Phrase Attachment
McLauchlan reimplements the tuple-counting model of Collins and Brooks
but with a different smoothing method. Smoothing means adjusting
probability estimates to avoid the low scores caused by sparse data;
usually this is done by using distributions with less context, which
consequently are expected to have less of a problem with sparse data.
Instead, probabilities of similar tuples are added to the function in
the first place, weighted by their similarity.
Two general-purpose thesauruses were used as sources of similarity
measures. In addition, a specialized thesaurus was created by running
the extraction algorithm on a corpus consisting solely of PP tuples.
The results:
- using thesaurus smoothing gets the model from 84.3% to 85.2%
- the general thesauruses actually worked better then the
specialized one
- similarity between entire tuples is better measured by rank
counting than by average similarity of its components.