PTB

Description

This CD-ROM contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.

In addition, the CD-ROM includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for ``tgrep'', a program that permits the user to search for specific constituents in tree structures. All software is provided ``as is''. (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. The cd-rom does include a pre-compiled program file for tgrep, built for use on Sun sparc systems.)

Release 2

The PTB Project Release 2 CDROM features the new PTB II bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This CDROM also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release.

The contents of Treebank Release 2 are as follows:

1 million words of 1989 Wall Street Journal material annotated in Treebank II style. A small sample of ATIS-3 material annotated in Treebank II style. 300-page style manual for Treebank II bracketing, as well as the part-of-speech tagging guidelines. The contents of the previous Treebank CDROM (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank I style). Tools for processing Treebank data, including ``tgrep'', a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port). In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous ftp from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2.

The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Detailed questions about the corpus may be sent to treebank@ldc.upenn.edu, while questions and requests for obtaining Treebank Release 2 should be sent to member-service@ldc.upenn.edu.

Release 3

This CD-ROM contains the following Treebank-2 Material:
  • 1 million words of 1989 Wall Street Journal material annotated in Treebank II style.
  • A small sample of ATIS-3 material annotated in Treebank II style.
  • A fully tagged version of the Brown Corpus.
and the following new material:
  • Switchboard tagged, dysfluency-annotated, and parsed text
  • Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.

After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdf's and that some of the pdf's that had been converted contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.

The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Features

  • Data type: text
  • Data source(s): varied
  • Application(s): natural language processing, parsing, tagging
  • Language(s): English
  • Number of CDs: 1
  • Nonmember price: US$2,500

Contact

Documentation

[MarcusEtal94]
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313--330, 1994. [ .ps.gz ]

  • Beatrice Santorini (1990): Part-Of-Speech Tagging Guidelines for the Penn Treebank Project. (PS)
  • Mitchell P. Marcus, Beatrice Santorini, Marry Ann Marcinkiewciy: Building a large annotated corpus for English: the Penn Treebank. University of Pennsylvania. * Marie Meteer et al. (1995): Dysfluency Annotation Stylebook for the Switchboard Corpus. (PS)

-- MichaelDaum - 04 Apr 2002
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback