The Spoken Wikipedia Corpora

The Spoken Wikipedia project unites volunteer readers of Wikipedia articles.
Hundreds of spoken articles in multiple languages are available to users who are – for
one reason or another – unable or unwilling to consume the written version of the article.
We turn this speech resource into a time-aligned corpus, making it accessible
for research and to foster new ways of interacting with the material.

The SWC is a corpus of aligned Spoken Wikipedia articles from the English and German Wikipedia. Dutch will follow next. This corpus has several outstanding characteristics:
  • hundreds of hours of aligned audio
  • from a diverse set of readers
  • about a diverse set of topics
  • in a well-researched textual genre
  • licensed under a free license (CC BY-SA 3.0)

For more information, please read our paper.

NEWS

  • Please participate in our http://www.timobaumann.de/temp/beaqle[rating experiment on speech quality]] (if you are a German speaker)!
  • Have a look at the amazing Wikipedia-Reader which allows to hyperlisten to the Spoken Wikipedia (and is based on the SWC); also have a look at the paper which proves the amazingness :-).
  • The SWC has its own ISLRN.

Current Statistics

  German English Dutch
#articles 960 1347 3169
#speakers 339 395 145
total audio (h) 358 394 224
aligned hours 240 184 31
… in full sentences 145 -- --

Download

Original Release (Spring 2016):

  • Aligned text (XML format):
  • corresponding audio:
  • LICENSE: CC BY-SA-3.0. Each folder in the audio contains a info.json with the original URL of both audio and text, and the speaker's name.
  • If you use this data, please cite our paper: bibtex
@InProceedings{KHN16.518,
  author = {Arne K{\"o}hn and Florian Stegen and Timo Baumann},
  title = {Mining the Spoken Wikipedia for Speech Data and Beyond},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro˛, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  islrn = {684-927-624-257-3/},
  language = {english}
 }

Sample Data

Software

WorkInProgress

this is only an internal link for now.
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback