British National Corpus

Description

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.

The corpus comprises 100,106,008 words, and occupies about 1.5 gigabytes of disk space -- the equivalent of more than a thousand high capacity floppy diskettes. To put these numbers into perspective, the average paperback book has about 250 pages per centimeter of thickness; assuming 400 words a page, we calculate that the whole corpus printed in small type on thin paper would take up about ten metres of shelf space. Reading the whole corpus aloud at a fairly rapid 150 words a minute, eight hours a day, 365 days a year, would take just over four years.

The corpus comprises 4,124 texts, of which 863 are transcribed from spoken conversations or monologues. Each text is segmented into orthographic sentence units, within which each word is automatically assigned a word class (part of speech) code. There are six and a quarter million sentence units in the whole corpus. Segmentation and word-classification was carried out automatically by the CLAWS stochastic part-of-speech tagger developed at the University of Lancaster. The classification scheme used for the corpus distinguishes some 65 parts of speech, which are described in the accompanying documentation.

See: "A brief users' guide to the grammatical tagging of the British National Corpus"

Features

-- MichaelDaum -- 11 Nov 2003
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback