UHH>Informatik>NatS>CDG Web>InternalPages>CorpusSurvey>IngosCorporaMail (17 Oct 2012, UnknownUser) Print version
IngosCorporaMail

Dear PAPA member, This summary was posted today. Might be of interest for the project.
Ingo
----------  Forwarded Message  ----------
Dear list members,

As requested by some of the respondents, I'd like to summarize the
responses I got to my inquiry on available syntactically parsed (treebank)
corpora for English, French, German, and other languages.  As reflected
below, there are a few good options for English and German, as well as
Chinese.  However, I did not receive any reply and could not locate any
such corpus for French.  Since we are about to embark on a project that
would benefit from the availability of such a corpus, I'd really appreciate
any information about French treebanks of any size and style.  And now on
to the summary:

1. ICE-GB corpus (British English)

The ICE-GB corpus is a 1m-word corpus of British English, fully parsed for
clause & phrase structure. For more info see:
http://www.ucl.ac.uk/english-usage/ice-gb/index.htm

Reply from:    Dr Gerald Nelson,
          Research Assistant Professor,
          Department of English,
          The University of Hong Kong,
          Pokfulam Road,
          Hong Kong SAR.

          Email: ganelson@hkucc.hku.hk
          Phone: (852) 2241-5141
          Fax: (852) 2559-7139
          http://www.hku.hk/english/staff/ganelson.htm

2.  TIGER project (German)

In the TIGER project we are creating a large syntactically annotated
corpus of German newspaper text. A corpus sampler will be released this
month:
http://www.ims.uni-stuttgart.de/projekte/TIGER/

My task is to develop a search tool for syntactically annotated corpora
- a first beta version will be released in October, the final version in
November.

Reply from:    Wolfgang Lezius                 lezius@ims.uni-stuttgart.de
          IMS, University of Stuttgart    Tel.: +49 +711 121-1374
          Azenbergstr. 12                 Fax:  +49 +711 121-1366
          D-70174 Stuttgart
          Germany

3.  NEGRA corpus (German)

The German ``NEGRA Corpus'', consists of parsed newspaper texts.
See http://www.coli.uni-sb.de/sfb378/negra-corpus/

Reply from:    Thorsten Brants
          brants@parc.xerox.com

4.  Verbmobil treebanks (German, English, Japanese)

We could help you with treebanks for English and German (and to some
degree for Japanese). They were developed in Tuebingen in the framework
of Verbmobil, a speech-to-speech translation project. For this reason,
the treebanks contain spontaneous speech data in the domains scheduling
of business appointments, travel scheduling, and hotel reservations.

The English treebank contains ca. 30,000 sentences, the German treebank
ca. 38,000 sentences. The Japanese treebank is somewhat smaller, it
contains ca. 18,000 sentences. The annotations for all treebanks cover
the levels of morpho-syntax, syntactic phrase structure, and
function-argument structure. The annotation schemes are purely
context-free, i.e. they do not contain crossing branches or traces.

Additionally, for each treebank, there exists an extensive stylebook,
which describes how different phenomena are annotated.

As the treebanks are only becoming available now (due to project
restrictions), I am not sure what the license conditions for commercial
use will be.

Reply from:    Sandra Kuebler
          University of Tuebingen
          Computational Linguistics
          Wilhelmstr. 113
          D-72074 Tuebingen
          Germany
          phone: +49-7071-2978490
          fax: +49-7071-551335
          email: kuebler@sfs.nphil.uni-tuebingen.de
          URL: http://www.sfs.nphil.uni-tuebingen.de/~kuebler/

5.  BLLIP99 corpora

Are you aware of the BLLIP99 corpora distributed by LDC?  30 million
words of WSJ text, machine parsed and coreferenced.

Reply from:    Eugene Charniak
          ec@bohr.cs.brown.edu

6.  Various links to check

You may want to check the list archives at:
http://www.hit.uib.no/corpora/
In case no one answers.

Also, the largest collection of corpora I know of is from The Linguistic
Data Consortium
http://www.ldc.upenn.edu/

Chris Manning also has an extensive list of links to corpus resources
http://www-nlp.stanford.edu/links/statnlp.html#Corpora

Reply from:    Daniel Walker
          Mendez, Inc.
          dwalker@lhsl.com

7.  Chinese Penn Treebank

This one is also available from LDC and contains about 100K words (4185
sentences from 325 articles from Xinhua newswire between 1994 and 1998).
It was parsed following the general methodology of the Penn Treebank.  It
costs $100.
See http://www.ldc.upenn.edu/Catalog/LDC2000T48.html

(I obtained this information by looking through the LDC catalog.)

Again, any information on syntactically parsed French corpora would be
greatly appreciated.
-- MichaelSchulz - 22 Oct 2001
CDG
Navigation
Publications
NatsWiki
Main
User
Sandbox
System
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback