Skip to main content
Språkbanken Text is a part of Språkbanken.

MultiGEC

Standard reference Information

{publication }

Data citation Information

Masciolini, Arianna, Caines, Andrew, De Clercq, Orphée, Kruijsbergen, Joni, Kurfali, Murathan, Muñoz Sánchez, Ricardo, Volodina, Elena, Östling, Robert, Allkivi-Metsoja, Kais, Arhar Holdt, Špela, Auzina, Ilze, Darģis, Roberts, Drakonaki, Elena, Frey, Jennifer-Carmen, Glišić, Isidora, Kikilintza, Pinelopi, Nicolas, Lionel, Romanyshyn, Mariana, Rosen, Alexandr, Rozovskaya, Alla, Suluste, Kristjan, Syvokon, Oleksiy, Tantos, Alexandros, Touriki, Despoina-Ourania, Tsiotskas, Konstantinos, Tsourilla, Eleni, Varsamopoulos, Vassilis, Wisniewski, Katrin, Žagar, Aleš, & Zesch, Torsten (2025). MultiGEC (updated: 2025-01-19). [Data set]. Språkbanken Text. https://doi.org/10.23695/h9f5-8143
BibTeX Additional ways to cite the dataset.
MultiGEC is a dataset for Grammatical Error Correction containing parallel data for 12 languages and 17 subcorpora. Each subcorpus contains two or more parallel versions of the same texts (typically, full learner essays), where one version (orig) is the one that the author originally wrote, and the others (ref1, ref2, ...) are corrected versions of the same text. Languages included: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian (English and Russian are available on request). Texts come from different original corpora, but are reformatted to a unified format.


Dataset description

MultiGEC is a dataset for Multilingual Grammatical Error Correction in 12 European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) compiled by the CompSLA working group and over 20 external data providers in the context of MultiGEC-2025, the first text-level GEC shared task.

The MultiGEC dataset is divided into 17 subcorpora covering different languages, domains and correction styles, summarized below. More detailed information about each subcorpus is available as machine-readable metadata, whose format is described .

Annotation

Each text is accompanied by a manually normalized version (i.e. corrected). No additional annotation has been performed or preserved from the source corpora. For three languages (Icelandic, German and Russian), the first version of the dataset consists of pre-tokenized texts, which will be detokenized in future releases.

Caveats

The data is relatively homogeneous, mostly consisting of full-text second language essays and their corrections. However, for some languages, native or heterogeneous data is used; and in certain languages the dataset does not contain full-text essays, but fragments of texts. Details on these aspects are provided on spraakbanken.gu.se/en/compsla/multigec-dataset

Intended uses

Grammatical Error Correction, (Second) Language Acquisiton studies, Learner Corpus Research, Noisy User-produced Data, pedagogical cases

References

Accessible through

Access Platform Licence
subject to Terms of Use
attribution, no-redistribution, no use with the proprietary models, no commercial use, personal access

Type

  • Corpus
  • Training and evaluation data

Language

Czech
German
Modern Greek (1453-)
English
Estonian
Icelandic
Italian
Latvian
Russian
Slovenian
Swedish
Ukrainian

Size

Keywords

  • grammatical error correction
  • language learning
  • essays
  • multilinguality

Creators

  • Masciolini, Arianna
  • Caines, Andrew
  • De Clercq, Orphée
  • Kruijsbergen, Joni
  • Kurfali, Murathan
  • Muñoz Sánchez, Ricardo
  • Volodina, Elena
  • Östling, Robert
  • Allkivi-Metsoja, Kais
  • Arhar Holdt, Špela
  • Auzina, Ilze
  • Darģis, Roberts
  • Drakonaki, Elena
  • Frey, Jennifer-Carmen
  • Glišić, Isidora
  • Kikilintza, Pinelopi
  • Nicolas, Lionel
  • Romanyshyn, Mariana
  • Rosen, Alexandr
  • Rozovskaya, Alla
  • Suluste, Kristjan
  • Syvokon, Oleksiy
  • Tantos, Alexandros
  • Touriki, Despoina-Ourania
  • Tsiotskas, Konstantinos
  • Tsourilla, Eleni
  • Varsamopoulos, Vassilis
  • Wisniewski, Katrin
  • Žagar, Aleš
  • Zesch, Torsten

Updated

2025-01-19

Contact

Språkbanken Text, Sweden | Ghent University, Belgium
sb-info@svenska.gu.se