TIPSTER Complete

Description

LDC93T3A
Complete TIPSTER corpus
LDC93T3B
Volume 1 of the TIPSTER corpus
LDC93T3C
Volume 2 of the TIPSTER corpus
LDC93T3D
Volume 3 of the TIPSTER corpus

The TIPSTER project is sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a new test collection built at NIST to be used both for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of 3 CDROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

Source (Vol)            YEAR    Approx. # Words (Millions)
---------------------------------------------------------
Associated Press (1)    1989    40
Associated Press (2)    1988    37
Associated Press (3)    1990    37
Wall Street Journal (1) 1987    20
                        1988    17
                        1989    6
Wall Street Journal (2) 1990    11
                        1991    22
                        1992    5
Dept. Of Energy (1)             28
Federal Register (1)    1989    38
Federal Register (2)    1988    30
Ziff/Davis (1)                  36
Ziff/Davis (2)          1989-90 26
Ziff/Davis (3)          1991-92 50
San Jose Mercury (3)    1991    45

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal, (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs so far released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.

Features

  • Data type: text
  • Data source(s): varied
  • Project(s): MUC, TREC, Tipster, TIDES
  • Application(s): information retrieval, language modeling
  • Language(s): English
  • Number of CDs: 3
  • Nonmember price: US$250

Contact

-- MichaelDaum - 04 Apr 2002
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback