The Hamburg Dependency Treebank

The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank currently available. It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures. The HDT is free for scientific/academic use.

The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The mapping from sentences to articles and authors is retained, allowing, e.g. analysis of individual style. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

If you have questions regarding the HDT, send an email to hdt at informatik.uni-hamburg.de

An example sentence

The HDT consists of three parts:
  • manually annotated and checked for consistency with DECCA (part A, 101,999 sentences)
  • manually annotated but not checked with DECCA (part B, 104,795 sentences)
  • automatically parsed with WCDG (part C, 55,027 sentences)

Download the HDT from the HZSK

Publications

Software

  • the toolbox, containing all sorts of helper scripts
  • cda_parse, a python library for parsing cda files
  • cobacose, a web-based treebank search system
  • jwcdg, the successor of the parser used for initial automatic annotation

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback