University of Hamburg - Computer Science - NATS home page - Ingo's home page

ICOPOST - Ingo's Collection Of POS Taggers



News

ICOPOST has been downloaded 255 times in the period from June 2001 to September 2002. During the last few months and (definitely) in the future I won't have the time to maintain to package. Therfore, I decided to open the development even further.

As of 2002/9/23 ICOPOST has been renamed to ACOPOST (A collection of POS taggers) and has been moved to the Sourceforge repository of open source projects. Please visit the project page at
http://sourceforge.net/projects/acopost/

I'm looking for developers, maintainer, admins, users etc. who can contribute to the survival of the project!

What it is

Part-of-speech (POS) tagging is the task of assigning grammatical classes to words in a natural language sentence. It's important because subsequent processing stages (such as parsing) become easier if the word class for a word is available.

ICOPOST is a set of freely available POS taggers that I modelled after well-known techniques. The programs are written in C and run under various UNIX flavors (and probably even under Windows). ICOPOST currently consists of four taggers which are based on different frameworks:
  1. Maximum Entropy Tagger MET: This tagger uses an iterative procedure to successively improve parameters for a set of features that help to distinguish between relevant contexts. It's based on a framework suggested by Ratnaparkhi [1997].
  2. Trigram Tagger T3: This kind of tagger is based on Hidden Markov Models where the states are tag pairs that emit words, i. e., it's based on transitional and lexical probabilities. The technique has been suggested by Rabiner [1990] and the implementation is influenced by Brants [2000].
  3. Error-driven Transformation-based Tagger TBT: Transformation rules are learned from an annotated corpus which change the currently assigned tag depending on triggering context conditions. The general approach as well as the application to POS tagging has been proposed by Brill [1993].
  4. Example-based tagger ET: Example-based models (also called memory-based, instance-based or distance-based) rest on the assumption that cognitive behavior can be achieved by looking at past experiences taht resemble the current problem rather than learning and applying acstract rules. They have been suggested for NLP by Daelemans et al. [1996].

A detailed description, an extensive evaluation and new suggestions can be found in an accompanying technical report [Schröder 2002].

References

Thosrten Brants. 2000. TnT - as statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000), Seattle, WA, USA.

Eric Brill. 1993. Automatic grammar induction and parsing free text: A transformation-based appraoch. In Proceedings of the 31st Annual Meeting of the ACL.

Walter Daelemans, Jakub Zavrel, Peter Berck & Steven Gillis. 1996. MBT: A memory-based part of speech tagger-generator. In Eva Ejerhed & Ido Dagan, ed., Proceedings of the Fourth Workshop on Very Large Corpora, pages 14-27.

Ingo Schröder. 2002. A Case Study in Part-of-Speech tagging Using the ICOPOST Toolkit. Technical report FBI-HH-M-314/02. Department of Computer Science, University of Hamburg. Available from http://nats-www.informatik.uni-hamburg.de/~ingo/papers/

Lawrence R. Rabiner. 1990. A tutorial on hidden markov models and selected applications in speech recognition. In Alex Waibel & Kai-Fu Lee, ed., Readings in Speech Recognition. Morgan Kaufmann, San Mateo, CA, USA, pages 267-290. See also Errata at http://www.media.mit.edu/~rahimi/rabiner/rabiner-arrata/

Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania.
Something wrong with these web pages? Please mail me.
last revised 2002/09/23 Ingo Schröder
http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/