Learner Language

Data citation

Språkbanken (2025). Learner Language (updated: 2025-01-19). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/svn8-rt31

Additional ways to cite the dataset.

Learner Language is a collection of corpor and lexicons that describe learner language. Corpora include both texts/audio produced by language learners, as well as texts/language they are exposed to (reading or listening to, e.g. course book texts). Even some derivative resources based on these corpora are included in this collection.

Included resources

COCTAILL - is a corpus of course booksused for teaching Swedish as a second language at CEFR levels A1, A2, B1, B2, and C1.
SweLL-gold is a second language learner corpus, featuring pseudonymization, normalization and correction-annotation.
SweLL-pilot is a second language learner corpus, featuring CEFR labeling.
DaLAJ resources are a collection of sentence pairs (original - corrected) containing one error each.
MultiGED -- Multilingual Grammatical Error Detection - is a dataset for grammamatical error detection, featuring five languages (Czech, German, English, Italian, Swedish). The data is organized by sentences, where each token has an annotation whether it is correct or incorrect (c or i). The corrected version is not provided. MultiGED has been used for a shared task (https://spraakbanken.github.io/multiged-2023/)
MuClaGED -- Multi-Class Grammatical Error Detection - is a dataset for Swedish only, organized by sentences, each incorrect token associated with the type of correction (Orthography, Syntax, Morphology, etc.) and the type of edit (Addition, Deletion, Replacement)
MultiGEC -- Multilingual Grammatical Error Correction is a dataset for grammamatical error detection, featuring twelve languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian). The data is organized by essay pairs (original - corrected). MultiGEC has been used for a shared task (https://spraakbanken.github.io/multigec-2025/)
SVALex is a wordlist generated from the COCTAILL corpus.
SweLLex is a wordlist generated from the SweLL-pilot essays.
Sen*Lex is a sense-based wordlist, combining SVALex and SweLLex in one.
CoDeRooMor is a morphologically annotated list based on Sen*Lex, featuring annotations for word-building morphemes (roots, prefixes, suffixes etc) and word-building mechanisms (affixation, compounding, etc.).
Swe-MWELex is a list of multi-word expressions based on Sen*Lex, with CEFR labels and subcategorizations of MWEs into several types
L2Lex-Adj is a list of adjectives based on Sen*Lex, with CEFR labels, information on adjectival declensions and frequences.
L2Lex-AdjAdv is a list of adjectives and adverbs based on Sen*Lex, with CEFR labels, information on patterns of comparative degrees and frequences.
Kelly - is a wordlist covering ca 8.000 most frequent words in the web texts, and assigned to the six CEFR levels based on the frequency information.

Intended uses

Research, development and pedagogical applications within (second) language acquisition and intelligent computer-assisted language learning

References

Elena Volodina, Ildikó Pilán, Stian Rødven-Eide, Hannes Heidarsson (2014): You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language, in NEALT Proceedings Series, volume 22, pages 128-144
Elena Volodina (2024): On two SweLL learner corpora – SweLL-pilot and SweLL-gold, in Proceedings of the Huminfra Conference (HiC 2024), 10-11 January, 2024, Gothenburg, Sweden / edited by Elena Volodina, Gerlof Bouma, Markus Forsberg, Dimitrios Kokkinakis, David Alfter, Mats Fridlund, Christian Horn, Lars Ahrenberg, Anna Blåder, pages 83-94
Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell (2016): SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 23-28, 2016, Portorož, Slovenia
Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén (2019): The SweLL Language Learner Corpus: From Design to Annotation, in Northern European Journal of Language Technology, volume 6, pages 67-104
Elena Volodina, Yousuf Ali Mohammed, Aleksandrs Berdicevskis, Gerlof Bouma, Joey Öhman (2023): DaLAJ-GED – a dataset for Grammatical Error Detection tasks on Swedish, in Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2023) / edited by David Alfter, Elena Volodina, Thomas François, Arne Jönsson and Evelina Rennes, pages 94-101
Elena Volodina, Christopher Bryant, Andrew Caines, Orphée De Clercq, Jennifer-Carmen Frey, Elizaveta Ershova, Alexandr Rosen, Olga Vinogradova (2023): MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection, in Proceedings of the 12th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2023)
Judit Casademont Moner, Elena Volodina (2022): Swedish MuClaGED: A new dataset for Grammatical Error Detection in Swedish, in Proceedings of the 11th Workshop on Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL 2022)
Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, Thomas François (2016): SweLLex: second language learners' productive vocabulary, in Linköping Electronic Conference Proceedings. Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016
Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack (2016): SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 23-28, 2016 Portorož, Slovenia
Elena Volodina, Sofie Johansson Kokkinakis (2012): Introducing Swedish Kelly-list, a new free e-resource for Swedish, in LREC 2012 Proceedings, volume 2012
Elena Volodina, Yousuf Ali Mohammed, Therese Lindström Tiedemann (2021): CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish, in 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) Proceedings, May 31–2 June, 2021, Reykjavik, Iceland Online / Simon Dobnik, Lilja Øvrelid (Editors)
Judit Casademont Moner, Elena Volodina (2022): Swedish MuClaGED: A new dataset for Grammatical Error Detection in Swedish, in Proceedings of the 11th Workshop on Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL 2022)

Datasets in this collection

Number of hits: 23

Resource	Type	Language	Access
COCTAILL Corpus of coursebooks used for teaching L2 Swedish. Annotated manually for text structure and pedagogical/didactical categories; automatically linguistically annotated. See more here https://spraakbanken.gu.se/forskning/teman/icall/icall-l2-projects/l2-data	Corpus	Swedish	Dataset: coctaill.xml.bz2 2017-10-30 – 16.57 MB – CC-BY-4.0 Word statistics: stats_COCTAILL.txt.zip 2025-04-22 – 621.39 KB – CC-BY-4.0 Explore in:
COCTAILL activities & examples Corpus of coursebooks used for teaching L2 Swedish. Annotated manually for text structure and pedagogical/didactical categories; automatically linguistically annotated.	Corpus	Swedish	Word statistics: stats_COCTAILL-AE.txt.zip 2025-04-22 – 352 KB – CC-BY-4.0 Explore in:
COCTAILL lesson text Corpus of coursebooks used for teaching L2 Swedish. Annotated manually for text structure and pedagogical/didactical categories; automatically linguistically annotated.	Corpus	Swedish	Word statistics: stats_COCTAILL-LT.txt.zip 2025-04-22 – 379.61 KB – CC-BY-4.0 Explore in:
CoDeRooMor, v.01 Morphological dataset (word-building morphology), Swedish L2 profiles project	Lexicon	Swedish	Dataset: CodeRoomor_v01_lemgramView.csv 2021-04-13 – 1.96 MB – CC-BY-4.0 Dataset: CodeRoomor_v01_morphemeView.csv 2021-04-13 – 856.29 KB – CC-BY-4.0 Dataset: CodeRoomor_v01_lemgramView.xlsx 2021-04-13 – 1.72 MB – CC-BY-4.0 Dataset: CodeRoomor_v01_morphemeView.xlsx 2021-04-13 – 699.46 KB – CC-BY-4.0 Explore in:
DaLAJ v.1.0 Dataset for Linguistic Acceptability Judgments (and more), v.1.0., is a collection of sentences from SweLL (Swedish Learner Language) essays. Each DaLAJ sentence contains one error only.	Corpus	Swedish	Dataset: datasetDaLAJsplit.csv 2021-06-21 – 1.46 MB – CC-BY-4.0 Dataset: dalaj_documentation.tsv 2021-06-21 – 7.48 KB – CC-BY-4.0
DaLAJ-GED-SuperLim 2.0 Dataset for Linguistic Acceptability Judgments (and more), v.2.0	Corpus	Swedish	Dataset: dalaj-ged-superlim.zip 2023-04-03 – 1.41 MB – CC-BY-4.0 Dataset: dalaj-ged-tsv.zip 2023-05-20 – 1.15 MB – CC-BY-4.0 Dataset: liuep197-11.pdf 2024-01-25 – 463.74 KB – CC-BY-4.0
Kelly Keywords for Language Learning for Young and adults alike	Lexicon	Swedish	Dataset: kelly.xml 2017-09-15 – 5.56 MB – CC-BY-4.0 Dataset: Swedish-Kelly_M3_CEFR.xls 2012-02-15 – 1.28 MB – CC-BY-4.0 Explore in:
L2Lex-Adj L2Lex-Adj is a sense-based word list of adjectives for Swedish as a second language, featuring frequencies from essays (productive vocabulary, based on SweLL-pilot) and from course books (receptive vocabulary, based on COCTAILL). For each adjective, their grammatical paradigms for degrees of comparison are provided.	Lexicon	Swedish	Dataset: l2lex-adj.xlsx 2025-02-20 – 85 bytes – CC-BY-4.0 Dataset: l2lex-adj.csv 2025-02-20 – 504.11 KB – CC-BY-NC-SA-4.0 Explore in:
L2Lex-AdjAdv L2Lex-AdjAdv is a sense-based word list of adjectives for Swedish as a second language, featuring frequencies from essays (productive vocabulary, based on SweLL-pilot) and from course books (receptive vocabulary, based on COCTAILL). For each item in the list, their grammatical structures for building degrees of comparison are provided (e.g. morphological, perphrastic, etc.).	Lexicon	Swedish	Dataset: l2lex-adjadv.xlsx 2025-02-20 – 85 bytes – CC-BY-4.0 Dataset: l2lex-adjadv.csv 2025-02-20 – 675.07 KB – CC-BY-4.0 Explore in:
MuClaGED MuClaGED is a dataset for multi-class Grammatical Error Detection for Swedish. The dataset is based on the SweLL-gold corpus.	Corpus	Swedish	Explore in:
MultiGEC MultiGEC is a dataset for Grammatical Error Correction containing parallel data for 12 languages and 17 subcorpora. Each subcorpus contains two or more parallel versions of the same texts (typically, full learner essays), where one version (orig) is the one that the author originally wrote, and the others (ref1, ref2, ...) are corrected versions of the same text. Languages included: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian (English and Russian are available on request). Texts come from different original corpora, but are reformatted to a unified format.	Corpus	Czech, German, Modern Greek (1453-), English, Estonian, Icelandic, Italian, Latvian, Russian, Slovenian, Swedish, Ukrainian	Explore in:
MultiGED MultiGEC is a dataset for Grammatical Error Detection (a task within NLP) containing data for 5 languages (Czech, English, German, Italian and Swedish).	Corpus	Czech, German, English, Italian, Swedish	Dataset: multiged-2023.tar.bz2 2025-01-22 – 3.82 MB – Other Explore in:
SenLex SenLex is a sense-based lexicon of vocabulary for Swedish as a second language, featuring frequencies from essays (productive vocabulary, based on SweLL-pilot) and from course books (receptive vocabulary, based on COCTAILL)	Lexicon	Swedish	Dataset: sen-lex.xlsx 2025-02-19 – 85 bytes – CC-BY-4.0 Dataset: sen-lex.csv 2025-02-19 – 5.08 MB – CC-BY-NC-SA-4.0 Explore in:
SVALex SVALex is a lexicon of receptive vocabulary for Swedish as a second language	Lexicon	Swedish	Dataset: svalex_xlsx.tar.bz2 2025-01-24 – 2.16 MB – CC-BY-NC-SA-4.0 Dataset: svalex_tsv.tar.bz2 2025-01-24 – 203.25 KB – CC-BY-NC-SA-4.0 Explore in:
Svenska MWELex Swe-MWELex is a sense-based word list of multi-word expressions that learners of Swedish as a second language can handle at the different levels of proficiency (according to the CEFR scale). The word list features MWE items and their frequencies from essays (productive vocabulary, based on SweLL-pilot) and from course books (receptive vocabulary, based on COCTAILL). Besides, each MWE has been classified by its type (based on their syntactic and lexical characteristics), as well as by a subgroup within the group of verbal MWEs)	Lexicon	Swedish	Dataset: swe-mwelex.xlsx 2025-03-12 – 184.75 KB – CC-BY-4.0 Dataset: swe-mwelex.csv 2025-02-20 – 414.88 KB – CC-BY-NC-SA-4.0 Explore in:
SweLL-gold Essays written by adult learners of Swedish, manually pseudonymized and correction annotated. The corpus contains both the original learner text and a corrected version of each essay. Collection period 2017-2020.	Corpus	Swedish	Word statistics: stats_SWELLV1-ORIGINAL.txt.zip 2025-04-22 – 147.52 KB – CC-BY-4.0 Word statistics: stats_SWELLV1-TARGET.txt.zip 2025-04-22 – 132.13 KB – CC-BY-4.0 Explore in:
SweLL-gold target SweLL-gold target is one of the two versions of SweLL-gold. It consists of the corrected learner-written texts. See SweLL-gold original for the original version of these.	Corpus	Swedish	Word statistics: stats_SWELL-TARGET.txt.zip 2025-04-22 – 30.08 KB – CC-BY-4.0 Explore in:
Collection SweLL-pilot Essays written by adult learners of Swedish, manually labeled with the CEFR levels (a European scale of language proficiency levels within language learning). Collection period 2006-2015.	Corpus	Swedish	See 3 collected resources Explore in:
SweLLex SweLLex is a lexicon of productive vocabulary for Swedish as a second language	Lexicon	Swedish	Dataset: SweLLex_v1_xlsx.tar.bz2 2025-01-24 – 3.21 MB – CC-BY-4.0 Dataset: SweLLex_v1_tsv.tar.bz2 2025-01-24 – 213.59 KB – CC-BY-4.0 Explore in:
TISUS texts Essays written by L2 Swedish learners as part of a TISUS exam	Corpus	Swedish	Explore in:
UD2.17_Swedish-SweLL A parallel Universal Dependencies treebank based on SweLL, the Swedish Learner Language corpus.	Corpus	Swedish	Dataset: ud217_swedish-swell.xml.bz2 2026-01-19 – 207.45 KB – CC-BY-4.0 Dataset: ud217_swedish-swell-target.xml.bz2 2026-01-19 – 212.09 KB – CC-BY-4.0 Dataset: ud217_swedish-swell.zip 2025-11-19 – 218.8 KB – CC-BY-SA-4.0 Word statistics: stats_ud217_swedish-swell.csv.zip 2026-01-19 – 27.59 KB – CC-BY-4.0 Word statistics: stats_ud217_swedish-swell-target.csv.zip 2026-01-19 – 27.04 KB – CC-BY-4.0 Explore in:
UD2.18_Swedish-SweLL A parallel Universal Dependencies treebank based on SweLL, the Swedish Learner Language corpus.	Corpus	Swedish	Dataset: ud218_swedish-swell.xml.bz2 2026-06-15 – 257.02 KB – CC-BY-4.0 Dataset: ud218_swedish-swell-target.xml.bz2 2026-06-15 – 262.28 KB – CC-BY-4.0 Dataset: ud218_swedish-swell.zip 2026-05-20 – 1.64 MB – CC-BY-SA-4.0 Word statistics: stats_ud218_swedish-swell.csv.zip 2026-06-15 – 30.73 KB – CC-BY-4.0 Word statistics: stats_ud218_swedish-swell-target.csv.zip 2026-06-15 – 30.23 KB – CC-BY-4.0 Explore in:
Written production in learner French This corpus contains student texts written by Swedish learners of French	Corpus	French

Data citation

Included resources

Intended uses

References

Datasets in this collection

Type

Language

Size

Keywords

Updated

Contact

DOI