The Spoken Wikipedia Corpora

The Spoken Wikipedia project unites volunteer readers of Wikipedia articles.
Hundreds of spoken articles in multiple languages are available to users who are – for
one reason or another – unable or unwilling to consume the written version of the article.
We turn this speech resource into a time-aligned corpus, making it accessible
for research and to foster new ways of interacting with the material.

The SWC is a corpus of aligned Spoken Wikipedia articles from the English and German Wikipedia. Dutch will follow next. This corpus has several outstanding characteristics:
  • hundreds of hours of aligned audio
  • from a diverse set of readers
  • about a diverse set of topics
  • in a well-researched textual genre
  • licensed under a free license (CC BY-SA 3.0)


  • The SWC has its own ISLRN.


Current Statistics

  German English Dutch
#articles 960 1347 3169
#speakers 339 395 145
total audio (h) 358 394 224
aligned hours 240 184 31
… in full sentences 145 -- --


Original Release (Spring 2016):

  • Aligned text (XML format):
  • corresponding audio:
  • LICENSE: CC BY-SA-3.0. Each folder in the audio contains a info.json with the original URL of both audio and text, and the speaker's name.
  • If you use this data, please cite our paper: bibtex
Sample Data



