BLLIP
Description
The Brown Laboratory for Linguistic Information Processing (BLLIP) two CD-ROM corpus contains a complete, Treebank-style parsing of the three year Wall Street Journal (WSJ) collection from the ACL/DCI corpus, approximately 30 million words. The parsing and part-of-speech (POS) tagging for the entire archive were done using statistically-based methods developed by Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.
This corpus both overlaps and supplements the 1-million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
The PTB project selected 2,499 stories from a three year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (
LDC1999T42) and Treebank-3 (
LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Features
- Data type: text
- Data source(s): newswire
- Project(s): TIDES
- Application(s): natural language processing, parsing, tagging
- Language(s): English
- Number of CDs: 2
- Nonmember price: US$100
Contact
- Authors: Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, Mark Johnson
- Email: ec@bohr.cs.brown.edu
--
MichaelDaum - 04 Apr 2002