BLLIP

Description

The Brown Laboratory for Linguistic Information Processing (BLLIP) two CD-ROM corpus contains a complete, Treebank-style parsing of the three year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, approximately 30 million words. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.

This corpus both overlaps and supplements the 1-million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.

The PTB project selected 2,499 stories from a three year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Features

  • Data type: text
  • Data source(s): newswire
  • Project(s): TIDES
  • Application(s): natural language processing, parsing, tagging
  • Language(s): English
  • Number of CDs: 2
  • Nonmember price: US$100

Contact

  • Authors: Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson
  • Email: ec@bohr.cs.brown.edu

-- MichaelDaum - 04 Apr 2002
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback