SweLL

Standard reference

Elena Volodina (2024): On two SweLL learner corpora – SweLL-pilot and SweLL-gold, in Proceedings of the Huminfra Conference (HiC 2024), 10-11 January, 2024, Gothenburg, Sweden / edited by Elena Volodina, Gerlof Bouma, Markus Forsberg, Dimitrios Kokkinakis, David Alfter, Mats Fridlund, Christian Horn, Lars Ahrenberg, Anna Blåder, pages 83-94

Data citation

Språkbanken (2025). SweLL (updated: 2025-01-19). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/b4wj-b251

Additional ways to cite the dataset.

SweLL -- Swedish Learner Language -- is a collection of SweLL corpora and derivative resources coming from these corpora. SweLL corpora consisf of learner texts written by learners with other mother tongues than Swedish. All texts have been collected in test situations (none of them coming from home-written tasks).

Included resources

SweLL-gold is a second language learner corpus, featuring pseudonymization, normalization and correction-annotation
SweLL-pilot is a second language learner corpus, featuring CEFR labeling
DaLAJ resources are a collection of sentence pairs (original - corrected) containing one error each
MultiGED -- Multilingual Grammatical Error Detection - is a dataset for grammamatical error detection, featuring five languages (Czech, German, English, Italian, Swedish). The data is organized by sentences, where each token has an annotation whether it is correct or incorrect (c or i). The corrected version is not provided. MultiGED has been used for a shared task (https://spraakbanken.github.io/multiged-2023/)
MuClaGED -- Multi-Class Grammatical Error Detection - is a dataset for Swedish only, organized by sentences, each incorrect token associated with the type of correction (Orthography, Syntax, Morphology, etc.) and the type of edit (Addition, Deletion, Replacement)
MultiGED -- Multilingual Grammatical Error Correction is a dataset for grammamatical error detection, featuring twelve languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian). The data is organized by essay pairs (original - corrected). MultiGEC has been used for a shared task (https://spraakbanken.github.io/multigec-2025/)

Intended uses

Research, development and pedagogical applications within second language acquisition and intelligent computer-assisted language learning

References

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell (2016): SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 23-28, 2016, Portorož, Slovenia
Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén (2019): The SweLL Language Learner Corpus: From Design to Annotation, in Northern European Journal of Language Technology, volume 6, pages 67-104
Elena Volodina, Yousuf Ali Mohammed, Aleksandrs Berdicevskis, Gerlof Bouma, Joey Öhman (2023): DaLAJ-GED – a dataset for Grammatical Error Detection tasks on Swedish, in Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2023) / edited by David Alfter, Elena Volodina, Thomas François, Arne Jönsson and Evelina Rennes, pages 94-101
Elena Volodina, Christopher Bryant, Andrew Caines, Orphée De Clercq, Jennifer-Carmen Frey, Elizaveta Ershova, Alexandr Rosen, Olga Vinogradova (2023): MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection, in Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2023)
Judit Casademont Moner, Elena Volodina (2022): Swedish MuClaGED: A new dataset for Grammatical Error Detection in Swedish, in Proceedings of the 11th Workshop on Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL 2022)

Datasets in this collection

Number of hits: 12

Resource	Type	Language	Access
DaLAJ v.1.0 Dataset for Linguistic Acceptability Judgments (and more), v.1.0., is a collection of sentences from SweLL (Swedish Learner Language) essays. Each DaLAJ sentence contains one error only.	Corpus	Swedish	Dataset: datasetDaLAJsplit.csv 2021-06-21 – 1.46 MB – CC-BY-4.0 Dataset: dalaj_documentation.tsv 2021-06-21 – 7.48 KB – CC-BY-4.0
DaLAJ-GED-SuperLim 2.0 Dataset for Linguistic Acceptability Judgments (and more), v.2.0	Corpus	Swedish	Dataset: dalaj-ged-superlim.zip 2023-04-03 – 1.41 MB – CC-BY-4.0 Dataset: dalaj-ged-tsv.zip 2023-05-20 – 1.15 MB – CC-BY-4.0 Dataset: liuep197-11.pdf 2024-01-25 – 463.74 KB – CC-BY-4.0
MuClaGED MuClaGED is a dataset for multi-class Grammatical Error Detection for Swedish. The dataset is based on the SweLL-gold corpus.	Corpus	Swedish	Explore in:
MultiGEC MultiGEC is a dataset for Grammatical Error Correction containing parallel data for 12 languages and 17 subcorpora. Each subcorpus contains two or more parallel versions of the same texts (typically, full learner essays), where one version (orig) is the one that the author originally wrote, and the others (ref1, ref2, ...) are corrected versions of the same text. Languages included: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian (English and Russian are available on request). Texts come from different original corpora, but are reformatted to a unified format.	Corpus	Czech, German, Modern Greek (1453-), English, Estonian, Icelandic, Italian, Latvian, Russian, Slovenian, Swedish, Ukrainian	Explore in:
MultiGED MultiGEC is a dataset for Grammatical Error Detection (a task within NLP) containing data for 5 languages (Czech, English, German, Italian and Swedish).	Corpus	Czech, German, English, Italian, Swedish	Dataset: multiged-2023.tar.bz2 2025-01-22 – 3.82 MB – Other Explore in:
SweLL-gold Essays written by adult learners of Swedish, manually pseudonymized and correction annotated. The corpus contains both the original learner text and a corrected version of each essay. Collection period 2017-2020.	Corpus	Swedish	Word statistics: stats_SWELLV1-ORIGINAL.txt.zip 2025-04-22 – 147.52 KB – CC-BY-4.0 Word statistics: stats_SWELLV1-TARGET.txt.zip 2025-04-22 – 132.13 KB – CC-BY-4.0 Explore in:
SweLL-gold target SweLL-gold target is one of the two versions of SweLL-gold. It consists of the corrected learner-written texts. See SweLL-gold original for the original version of these.	Corpus	Swedish	Word statistics: stats_SWELL-TARGET.txt.zip 2025-04-22 – 30.08 KB – CC-BY-4.0 Explore in:
Collection SweLL-pilot Essays written by adult learners of Swedish, manually labeled with the CEFR levels (a European scale of language proficiency levels within language learning). Collection period 2006-2015.	Corpus	Swedish	See 3 collected resources Explore in:
SweLLex SweLLex is a lexicon of productive vocabulary for Swedish as a second language	Lexicon	Swedish	Dataset: SweLLex_v1_xlsx.tar.bz2 2025-01-24 – 3.21 MB – CC-BY-4.0 Dataset: SweLLex_v1_tsv.tar.bz2 2025-01-24 – 213.59 KB – CC-BY-4.0 Explore in:
TISUS texts Essays written by L2 Swedish learners as part of a TISUS exam	Corpus	Swedish	Explore in:
UD2.17_Swedish-SweLL A parallel Universal Dependencies treebank based on SweLL, the Swedish Learner Language corpus.	Corpus	Swedish	Dataset: ud217_swedish-swell.xml.bz2 2026-01-19 – 207.45 KB – CC-BY-4.0 Dataset: ud217_swedish-swell-target.xml.bz2 2026-01-19 – 212.09 KB – CC-BY-4.0 Dataset: ud217_swedish-swell.zip 2025-11-19 – 218.8 KB – CC-BY-SA-4.0 Word statistics: stats_ud217_swedish-swell.csv.zip 2026-01-19 – 27.59 KB – CC-BY-4.0 Word statistics: stats_ud217_swedish-swell-target.csv.zip 2026-01-19 – 27.04 KB – CC-BY-4.0 Explore in:
UD2.18_Swedish-SweLL A parallel Universal Dependencies treebank based on SweLL, the Swedish Learner Language corpus.	Corpus	Swedish	Dataset: ud218_swedish-swell.xml.bz2 2026-06-15 – 257.02 KB – CC-BY-4.0 Dataset: ud218_swedish-swell-target.xml.bz2 2026-06-15 – 262.28 KB – CC-BY-4.0 Dataset: ud218_swedish-swell.zip 2026-05-20 – 1.64 MB – CC-BY-SA-4.0 Word statistics: stats_ud218_swedish-swell.csv.zip 2026-06-15 – 30.73 KB – CC-BY-4.0 Word statistics: stats_ud218_swedish-swell-target.csv.zip 2026-06-15 – 30.23 KB – CC-BY-4.0 Explore in:

Standard reference

Data citation

Included resources

Intended uses

References

Datasets in this collection

Type

Language

Size

Keywords

Updated

Contact

DOI