PP Attachment

Prepositional phrases can attach at many places in a parse tree, and which attachment site is correct is a difficult decision (even human annotators usually disagree about 10% of the data). Hence, partial parsers typically do not make it. Other related problems exist such as attachment of adverbs or conjunctional phrases, but pp attachment is the poster child of syntactic ambiguity resolution. In fact, it was to represent the large number of possible pp attachments that Hiroshi Maruyama first introduced CDG.

Here are some research reports about solving the pp attachment task automatically.

Whittemore, Ferrara & Brunner 1990:Empirical study of Predictive Powers of Simple Attachment Schemes for Post-modifier Prepositional Phrases

The authors contrast the performance of traditional linguistic attachment theories such as Minimal Attachment with their own scheme of Lexical Preference, which they find to be superior. In the restricted-domain, typed-input, low-speed context of travel planning assistance, "attachments are essentially always predictable". However, their procedure is a complicated multi-step algorithm using hand-annotated features of the words, such as "this noun takes a temporal qualifier".

Donald Hindle & Mats Rooth 1993: Structural Ambiguity and Lexical Relations

This is the seminal paper about automatic pp attachment. It formulates the canonical task: given a verb, a following noun and a preposition, decide whether the preposition attaches to the verb or the noun. (Note that this is a lot easier than what a parser actually needs to do.)

From the output from a partial parser, 224,000 triples were extracted and used as training material. The Lexical Attraction score is essentially the logged ratio of the frequencies of verb/noun attachment for a particular triple. The method achieves 80% correctness, where human annotators achieve 85% (88% if they can see the entire example and not only the three head words).

Adwait Ratnaparkhi, Jeff Reynar and Salim Roukos 1994: A Maximum Entropy Model for Prepositional Phrase Attachment

Here the problem is extended from judging a triple of words to a quadruple by also looking at the object of the preposition in question. The task is solved with a maximum entropy model, and 81.6% correctness are reported on the WSJ corpus.

Eric Brill and Philip Resnik 1994: A Rule-Based Approach to Prepositional Phrase Structure Attachment Disambiguation

Eric Brill, of course, solves the same problem by transformation-based error-driven learning. He achieves 80.8% accuracy with 471 transformations. Interestingly, he then uses the noun classes from WordNet to create more general rules, and achieves 81.8% accuracy with only 266 rules. The pros and cons are much the same as for transformation-based POS tagging: huge learning time, and the method is supervised-only.

Michael Collins and James Brooks 1995: Prepositional Phrase Attachment through a Backed-Off Model

In a surprising step backward, Collins and Brooks solve the attachment problem even better by simple tuple-counting. When no instance of a 4-tuple is known from training, they back off to triples and pairs, They show that you should back off only to those tuples that contain the preposition. The method achieves 84.5% accuracy. They also note that, contrary to popular wisdom, low-frequency tuples should not be discarded, since they contribute about 3% to the results.

Adwait Ratnaparkhi 1997: A Simple Introduction to Maximum Entropy Models for Natural Language Processing

This is not about pp attachment per se, but is a very good explanation of what maximum entropy means and how to compute it.

Jiri Stetina and Makato Nagao 1997: Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary

This work uses an ID3 decision tree to predict noun or verb attachment on the same now-standard data, and achieves an unsurpassed 88.1% of accuracy. The authors explain that they were inspired by the fact that Collins's backed-off model performs at 92.6% on instances when it does not have to back off (i.e. the entire quadruple has been seen before). Hence, one should try to increase the number of times that full matches happen. For instance, `buy books for children' should be considered as the same as `buy magazines for children'.

To do this, the WordNet hierarchy was employed and a concept of semantic distance was introduced (a mixture of path length between two nodes and their absolute depths). An entire unsupervised algorithm for word sense disambiguation was invented to prepare the training data, since hand-disambiguating word senses in the WSJ corpus is infeasible. The resulting decision tree is extremely efficient and performs almost as well as human readers (when they see only the quadruple).

Paola Merlo, Matthew W. Crocker, and Cathy Berthouzoz 1997: Attaching Multiple Prepositional Phrases: Generalized Backed-off Estimation

This work generalizes Collins's back-off model to deal with more than one PP. (The main differences are that the kernel noun is never used.) If a second PP must be attached, it is estimated under the assumption that the first PP stays where it was already attached. In this way the rich triple counts from the original algorithm can be reused for further decisions rather than the extremely poor 6-tuple and 8-tuple counts one would otherwise have to use. This is justified by observing that the distance of PP to attachment does not change the attachment behaviour very much.

Collins's figure is recreated for the first PP, but the second is only correct 69.6% of the time and the third, 43.6%. This is explained by both the larger space of possibilities and the sparser data for training.

Michael Niemann 1998: Determining PP Attachment through Semantic Associations and Preferences

Niemann employs semantic classes (such as `action' or `artifact') instead of individual words to learn attachment preferences. His Lexass PP parser operates in several stages and makes a decision whenever the criteria of the current stage seem decisive (are above the cutoff). It can resolve multi-way ambiguities rather than the one-bit problem usually posed. On two-way problems, it also achieves 80% when using all available data, but in all other instances much less. The method requires additional hand-annotation of the training corpus for word sense and syntactic function.

Adwait Ratnaparkhi 1998: Statistical Models for Unsupervised Prepositional Phrase Attachment

This approach again uses the bigram count method to achieve 81.9% accuracy. The new technique is to use unannotated text, but about 50 times as much of it as ever before, and select only those cases where the attachment is unambiguous even without annotation (e.g.: `lawyers in other jurisdictions'). With this trick, unsupervised learning of attachment becomes as powerful as supervised learning. The results are presented for English and Spanish, but unfortunately none of the heuristics work for German.

Deniz Yuret 1998: Discovery of Linguistic Relations Using Lexical Attraction

In his PhD thesis, Yuret builds a program that discovers relations between words in free text through lexical attraction. Through feedback from the processor to the memory of the system, even long-distance relations can be found. His representation differs from dependency grammar in that the relations are always undirected. He gives impressive success stories (a perfect dependency tree for the sentence `these people also want more government money for education', with no prior information whatsoever), but the overall accuracy is only 50% even between function words.

Josep M. Sopena; Agusti LLoberas; Joan L. Moliner 1998: A Connectionist Approach to Prepositional Phrase Attachment for Real World Texts

The quadruple task is solved on examples from the WSJ corpus using a feed-forward neural network with one hidden layer. Wordnet is used but simplified to those classes that actually occur in the training ans test sets. All classes of all senses of all words in the quadruple are presented as input simultaneously (and the ouput is a single bit, V or not V). 86.8% are achieved in the best configuration.

The authors ascribe this improvement over the previous result using only class information to a better use of class hierarchies. For instance, Wordnet is called an inadequate hierarchy since it is handcrafted. This approach can afford to present all components of the quadruple simlutaneously because the parameter space is kept small by using classes rather than words themselves.

It sounds very surprising that the identity of verbs should be completely ignored. After all, `live' and `thrive' are fairly close in meaning, but `live' selects `by' while `thrive' prefers `on'. However, `Verbs can be classified on the basis of the kind of prepositions they select', and that is in fact done for the best figures. That would seem to mean that the technique does not use only classes, as I understand that word. It is also possible that English is more logical in pairing senses with prepositions (rather than lexemes with prepositions, as German does); in fact, both `live on' and `thrive by' are perfectly acceptable as well. Interesting.

Patrick Pantel and Dekang Lin 2000: An Unsupervised Approach to Prepositional Phrase Attachment using Contextually Similar Words

This paper treats the same 3097 WSJ quadruples as Rathnaparkhi six years previously and achieves similar results, but with only unsupervised methods: 125,000,000 words of newspaper text were parsed with Minipar, which creates disambiguated dependency trees. A special-purpose helper program finds additional possible attachment sites. A training corpus of unambiguous examples is then constructed by using only those sentences in which the helper fails.

Classification is done via Roth's generalized feature model which allows up to 15 features (namely all possible subsequences of V, N1, P and N2), but only features with P in it are used since Collins found that those are most relevant. Also, features with both V and N1 in them are discarded. To increase the number of matches, contextually similar words are computed by observing which lexical items appear in the same relation to a particular word. This allows the set eat:cook,drink,consume etc. to be extracted. The final result is 84.6% accuracy once obvious errors in the test set (the tagged as noun) are removed.

Dimitrios Kokkinakis 2000: Supervised PP-Attachment for Swedish (Combining Unsupervised and Supervised Training Data)

Kokkinakis solves the quadruple task for Swedish with Memory-Based Learning. Part of the training data comes from the extensively annotated Gothenburg Lexical Database, which includes valency frames for verbs etc. Other training material was extracted from raw text exploiting the heuristic `the attachment is always V if N1 is a pronoun', and yet more by manually reviewing the output of the shallow parser Cass-SWE. From a large number of metrics and algorithms, the best result on 207 test cases is 70.4% when using only GLDB information, and 86.4% when using all information available.

The author claims that the `comparable and even slightly better' results are due to the multiple sources of information and lemmatization. But it is more likely that lemmatization is necessary in Swedish but not in English, and that the higher number is due to extensive parameter tuning and other limiting factors - for instance, the 38 examples that the human arbiters disagreed on were excluded from the test set, which of course makes the task easier. The entire writeup is somewhat confused, the author claiming both that `The MBL method is not sensitive to sparse data' and `a disadvantage with MBL [...] is that the MBL's learning component requires a large number of instances [...] for the good performance of the algorithms.'

Mark McLauchlin 2001: Maximum Entropy Models and Prepositional Phrase Ambiguity

This thesis solves the binary attachment problem once more with maximum entropy techniques, bringing the final performance to 85.5%. The new contribution is to bring in knowledge from latent semantic analysis (finding lexical preferences from large unlabelled corpora). Ratnaparkhi's 1994 data set is used - it seems to have become a standard for this kind of task.

Martin Volk 2001: Exploiting the WWW as a corpus to resolve PP Attachment Ambiguities

This appears to be both the first application of PP attachment techniques to German, and the first to use search engine queries as its corpus. 3000 sentences from the Computerwoche CD were decorated with the usual quintuples as test data. AltaVista's NEAR operator was used to estimate co-occurrence frequencies in the entire German WWW. The work focuses on trading off reliability and coverage, and so not all cases are always solved. In the final configuration, lemmatization, compound analysis and backing off to one data point is used, and all remaining cases are simply assigned to N. This achieves 73.1% accuracy at full coverage.

Martin Volk 2002: Combining unsupervised and supervised methods for PP attachment disambiguation

This paper anticipates my of our own decisions in PP attachment for German. Volk uses most of the same obvious steps as we did: co-occurrence frequencies as probability estimates, de-compounding, lemmatization, and even the very same extraction method (prepositions immediately following a noun or in the same clause as a verb, and the same noun boost factor: 5). 5803 NEGRA quadruples are used for supervised training and Computer-Zeitung for unsupervised training.

Comparisons of both show that supervised learning is better in principle but not actually because of sparse data; so a 9-step back-off algorithm is finally used that tries in turn to use FVG triples (`zur Anwendung kommen'), then supervised quadruples, then supervised triples, then unsupervised triples, etc. The final result is 80.98%, which might once again indicate that German word order makes the task harder than in English.

Eric Gaussier and Nicola Cancedda 2002: Probabilistic models for PP-attachment Resolution and NP Analysis

The authors present a probabilistic model that generalizes several previous approaches and explicitly list the independence assumptions that it implies, e.g. "The possible tree shapes are uniformly distributed", "The POS of a nucleus depends only on its semantic class". A `nucleus' is a word sequence that contains the lexical head whose internal structure is considered unambiguous; nuclei are generalized to safe chains: sequences of nuclei where all following nuclei are attached to other preceding nuclei. A complex generative model of language is constructed that can be estimated via Maximum Likelihood. Using French dictionaries with extensive subcategorization info, PP quadruple attachment achieved 85% accuracy on sentences from Le Monde. But general inter-nucleus attachment achieves only 73.7%, which is not much more than the 72.1% baseline.

Hang Li 2002: Word Clustering and Disambiguation Based on Co-occurrence Data

This work addresses the quadruple task using word counts from the tagged (not the parsed) WSJ corpus (126,000 sentences). An algorithm called 2D-clustering is used which merges two noun classes and two verb classes in alternation, while minimizing the total data description length (this is equivalent with maximizing the mutual information). By using co-occurrence in statistically extracted verb phrases, very plausible groupings are achieved such as "acquire/buy/hold/purchase/sell" or "share/asset". Ten-fold cross-validation is then performed over the parsed WSJ corpus. Unfortunately, the results are very difficult to decipher. Li claims an accuracy of 98.6% at a coverage of 50.8%, and an accuracy of 76.2% at total coverage, but the abstract claims 88.2%. That number, however, only appears in the table "Comparison with previous work", where 88.2% is assigned to the "Back-Off" method, a complex five-step method which involves 2D clustering, previous work by Li and Abe involving both an automatically extracted thesaurus and WordNet, and further back-off techniques. (Presumably this is at a coverage of 100%, and the table caption is just misleading?)

Núria Gala Pavia and Salah Aït-Mokhtar 2003: Lexicalising a robust grammar parser using the WWW

This contribution seems extremely exciting at first glance, since the authors use a handwritten robust dependency-based parser of French with soft rules (XIL). However, no explanation of the actual rules is found here, no accuracy is ever given for full parsing (only for PP attachment), and the test corpus comprises 89 sentences only.

The novelty here is that ambiguously parsed PP attachments are fed to the search engine Altavista in the format X Prep NEAR N, for instance: "(impact sur) NEAR comptes". The resulting documents are then parsed again, and only those in which the parser indicates that this is actually a PP and not a mere nearness are counted. For instance, to estimate how probable the PP (impact+sur+comptes) is, the phrase `impact sur les comptes' are counted as a positive example, but `impact sur la fermeture de vos comptes' is not.

These counts are then used to form a preference lexicon which is used in a second run of the entire parsing system. This increases PP attachment precision from 71% to 83%, but decreases recall from 92% to 85%.

Lee Schwartz, Takako Aikawa and Chris Quirk 2003: Disambiguation of English PP Attachment using Multilingual Aligned Data

A bilingual aligned English-Japanese corpus was used to improve PP attachment in machine translation. The key is that Japanese is syntactically unambiguous with respect to the classical quadruple task, so an aligned bilingual J-E corpus can be used as a corpus that is annotated as far as PP attachment is concerned. The MSR-MT system produces low PP attachments by default, so the corpus was used merely to collect information on when a verb has greater affinity for a PP than a noun does. With an additional dictionary of such V+P pairs, the resulting translations were judged by human informers to be significantly better, whether they were from English to Japanese or to Spanish.

Mark McLauchlan 2004: Thesauruses for Prepositional Phrase Attachment

McLauchlan reimplements the tuple-counting model of Collins and Brooks but with a different smoothing method. Smoothing means adjusting probability estimates to avoid the low scores caused by sparse data; usually this is done by using distributions with less context, which consequently are expected to have less of a problem with sparse data. Instead, probabilities of similar tuples are added to the function in the first place, weighted by their similarity.

Two general-purpose thesauruses were used as sources of similarity measures. In addition, a specialized thesaurus was created by running the extraction algorithm on a corpus consisting solely of PP tuples. The results:

  • using thesaurus smoothing gets the model from 84.3% to 85.2%
  • the general thesauruses actually worked better then the specialized one
  • similarity between entire tuples is better measured by rank counting than by average similarity of its components.
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback