A multilingual corpus of linguistic descriptions of the world's natural languages.

This resource contains an openly available multilingual digitized version of thousands of documents describing natural languages of the world. The corpus is annotated with various meta, word, and text level attributes. More details about the data and annotations can be found in the reference given below.

There is also a password protected part of the corpus which can be found here.

Standard reference:
Shafqat Virk, Harald Hammarström, Markus Forsberg, Søren Wichmann (2020): The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages, in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, 11–16 May 2020 / Editors : Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis BibTeX


Resource type Corpus
Tokens 75,027,790
Sentences 5,740,264


