XML-based Integration of Natural Language Processing Components

Ulrich Schäfer, DFKI, Language Technology Lab, Saarbrücken, Germany


While XML and its predecessor SGML have been used extensively for (offline) corpus annotation, today more and more natural language processors output XML online. This course will focus on the XML-based integration of NLP components that can help to increase robustness and reduce ambiguity in natural language processing systems.

After a brief introduction to XML, Unicode, DTD and XML Schema, we will focus on technologies and applications for integrating XML output of multiple NLP processors. We will study XML formats and integration issues for part-of-speech tagging, morpho-syntax, named entities, chunking, parsing, semantics and ontologies, including related, current standardization efforts (e.g. ISO, W3C), but also general concepts such as standoff annotation and multi-dimensional markup.

We will then introduce XML integration and query languages such as XPath, XSLT and XQuery, and, as a practical exercise, use them to integrate real NLP markup. Finally, existing tools and architecture frameworks for XML NLP markup integration will be presented.

