The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank currently available. It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures.
The HDT is free for scientific/academic use.
The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001.
The content of the articles ranges from formulaic periodic updates on new
BIOS revisions and processor models or quarterly earnings
of tech companies over features about general trends in the
hardware and software market to general coverage of social,
legal and political issues in cyberspace, sometimes in the
form of extensive weekly editorial comments. The mapping
from sentences to articles and authors is retained, allowing,
e.g. analysis of individual style. The creation of the
treebank through manual annotation was largely interleaved
with the creation of a standard for morphologically and
syntactically annotating sentences as well as a constraint-based
parser.
If you have questions regarding the HDT, send an email to hdt at informatik.uni-hamburg.de
An example sentence
The HDT consists of three parts:
manually annotated and checked for consistency with DECCA (part A, 101,999 sentences)
manually annotated but not checked with DECCA (part B, 104,795 sentences)
automatically parsed with WCDG (part C, 55,027 sentences)