English Gigaword
Description
English Gigaword was produced by Linguistic Data Consortium (LDC)
catalog number
LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD.
This is a comprehensive archive of newswire text data in English that has been
acquired over several years by the LDC.
Four distinct international sources of English newswire are represented here:
- Agence France Press English Service (afe)
- Associated Press Worldstream English Service (apw)
- The New York Times Newswire Service (nyt)
- The Xinhua News Agency English Service (xie)
Much of the content in this collection has been published previously by the LDC
in a variety of other, older corpora, particularly the North American News
text corpora (
LDC95T21,
LDC98T30), the various TDT corpora and the AQUAINT
text corpus (
LDC2002T31). But there is a significant amount of material that is
being released here for the first time: all of the Agence France Presse content,
the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from
February 2001 forward.
All text data are presented in SGML form, using a very
simple, minimal markup structure; all text consists of printable ASCII
and whitespace. The corpus has been fully validated by a standard SGML parser
utility (nsgmls), using a DTD file which is provided as part of this publication.
Features
- Data Type: text
- Data Source(s): newswire
- Project(s): EARS, TIDES
- Application(s): information retrieval, language modeling, natural language processing
- Language(s): English
- Distribution: 1 DVD(s).
- Membership Year(s): 2003
- Non-member Price: US$2500
- Non-member License: yes
- sample data
--
MichaelDaum --
11 Nov 2003