Automatic Content Extraction 2
Description
ACE-2 Version 1.0 was produced by Linguistic Data Consortium (LDC)
catalog number
LDC2003T11 and ISBN 1-58563-270-8.
This release contains Version 1.0 of the ACE-2 corpus, created and
distributed by the LDC to support the Automatic Content Extraction
(ACE) program. The objective of the ACE program is to develop
extraction technology to support automatic processing of source language
data (in the form of natural text, and as text derived from ASR and OCR).
This includes classification, filtering, and selection based on the
language content of the source data, i.e., based on the meaning conveyed
by the data. Thus the ACE program requires the development of technologies
that automatically detect and characterize this meaning. The ACE research
objectives are viewed as the detection and characterization of Entities,
Relations, and Events. There are three main ACE tasks: Entity Detection and
Tracking, Relation Detection and Characterization, and Event Detection and
Characterization.
Annotations for the ACE-2 corpus were produced by Linguistic Data Consortium
to support the following two research tasks: Entity Detection and Tracking
(EDT) and Relation Detection and Characterization (RDC).
For information regarding the ACE program and ACE technology evaluations
administered by the National Institute of Standards and Technology (NIST),
please visit the NIST website.
For information about ACE annotation and ongoing ACE corpus development,
including annotation guidelines, task definitions, annotation tools and other
project documentation, please visit the ACE Project page at the LDC.
This publication contains two sets of data: training and devtest.
Each of these sets is further divided by source: broadcast news, newspaper,
and newswire.
The training contains data originally developed as training material
for the February 2002 evaluation and again for the September 2002 evaluation.
The devtest contains data originally developed as test data for the February
2002 evaluation and later used as devtest data for the September 2002 evaluation.
The broadcast and newswire source data is drawn from a subset of the TDT2 Multilanguage
Text Version 4.0 (
LDC2001T57); this has been supplemented with additional newspaper
data from the Washington Post. A portion of the training broadcast data was drawn
from the 1997 English Broadcast News Transcripts (Hub-4) corpus (
LDC98T28).
All material comes from the first half of 1998.
This publication includes both the source data files in .sgm format
and the annotation files in ACE Pilot Format (APF), supporting documentation,
and version 2.0.1 of the ACE DTD which was used for the September 2002
ACE Evaluation.
Features
- Data Type: text (179,007 words of source data, or 519 files)
- Data Source(s): varied
- Project(s): ACE
- Application(s): automatic content extraction, information detection
- Language(s): English
- Distribution: download
- Membership Year(s): 2003
- Non-member Price: US$500
--
MichaelDaum --
11 Nov 2003