American National Corpus
Description
The goal of the final American National Corpus is to contain at least 100 million words, comparable across genres to the BNC. This publication represents the First Release, consisting of over 10 million words of written and spoken American English, annotated for lemma and part of speech, in both a "stand-off" and "merged" format. The texts included in the first 10 million words are those that were first received, therefore the corpus is not balanced for genre as the full corpus will be, nor has there been any validation of either the XML encoding or the morpho-syntactic annotation, both of which were produced entirely automatically (see ANC Organization: Encoding). Headers are minimal, although they contain fairly complete information concerning domain, subdomain, subject, audience, and medium. The ANC is encoded in XML, conformant to the XML Corpus Encoding Standard (XCES) schemas for primary data and annotations.
Features
--
MichaelDaum --
11 Nov 2003