Reference Resolution of Personal Pronouns

Diploma thesis by Carsten Kowalewski

The thesis is about the automatic resolution of personal pronouns in natural language texts. The system is based on the well known C4.5 machine learning algorithm and uses syntax trees from the WCDG system's output. The former is trained by feature vectors, which are derived from the information contained in manually annotated texts and the syntax trees. For annotating coreferences a web based environment has been designed and implemented. It was used to annotate some hundred texts of the Heise corpus, which is a collection of written online news in German from the field of computers in the broadest sense.

Among the features extracted from the annotated texts are gender and number agreement, syntactic role, string similarity and various distance measures. In the above mentioned training process the C4.5 algorithm builds decision trees which are subsequently being used to annotate texts automatically. Finally these two sets of texts -- manually and automatically annotated ones -- are compared to each other using the 7th Message Understanding Conference's scoring software.

-- CarstenKowalewski -- 14 Apr 2006

