English Gigaword

Description

English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC.

Four distinct international sources of English newswire are represented here:
  • Agence France Press English Service (afe)
  • Associated Press Worldstream English Service (apw)
  • The New York Times Newswire Service (nyt)
  • The Xinhua News Agency English Service (xie)

Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward.

All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.

Features

  • Data Type: text
  • Data Source(s): newswire
  • Project(s): EARS, TIDES
  • Application(s): information retrieval, language modeling, natural language processing
  • Language(s): English
  • Distribution: 1 DVD(s).
  • Membership Year(s): 2003
  • Non-member Price: US$2500
  • Non-member License: yes
  • sample data

-- MichaelDaum -- 11 Nov 2003
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback