
The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.

The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts); additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words.

Language    (Subcorpus #) Kwords                                  Totals

German      (70)         34291  (09)   191  (65)   20  (28)  187
            (29)         59     (30)    76  (47)   24  (59)   50
            (71)         21    (70A)   999                        35918
French      (31)         4775   (04)  4121  (28)  187  (29)   59
            (30)         76     (47)    24  (51)    6  (59)   50
            (71)         21     (32)  1667                        10986
Spanish     (31)         4500   (13)   830  (14) 1041  (15)  447
            (47)         24     (32)  1667    8  (59)   50  (71)  8580
English     (31)         4222   (36)  1141  (74)  95   (28)  187
            (47)         24     (51)     6  (56)  97   (59)   50
            (71)         21     (32)  1667                        7510
Dutch       (03)         5500   (02)   600  (47)  24   (71)   21  6145
Czech       (44)         4726                                     4726
Italian     (11)         3518   (42)   303  (58)  13   (29)   59
            (30)         76     (47)    24  (71)  21              4014
Chinese     (78)         2895                                     2895
Greek       (10)         2515   (47)    24  (59)  50   (71)   21  2610
Norwegian   (41)         2226                                     2226
Swedish     (37)         1718                                     1718
Serb/Croat/Slov(24)      700    (56)   289                        989
Tibetan     (76)         834                                      834
Portuguese  (60)         675    (47)    24  (71)  21              720
Malay       (80)         563                                      563
Russian     (73)         364                                      364
Japanese    (57)         203                                      203
Turkish     (20)         173   (20A)  110                         283
Albanian    (82)         205                                      205
Gaelic      (55)         141                                      141
Estonian    (39)         100                                      100
Usbek       (81)          88                                      88
Latin       (74)          75                                      75
Danish      (47)          24    (71)   21                         45
Lithuanian  (89)          20                                      20
Bulgarian   (84)           5                                      5

Total                                                             91969


  • Data type: text
  • Data source(s): varied
  • Application(s): information retrieval, machine translation, language modeling
  • Language(s): Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan
  • Number of CDs: 1
  • Nonmember price: US$35


-- MichaelDaum - 04 Apr 2002
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback