Hoppa till huvudinnehåll
Språkbanken Text är en avdelning inom Språkbanken.

Learner Language

Datacitering Information

Språkbanken Text (2025). Learner Language (uppdaterad: 2025-01-19). [Data set]. Språkbanken Text. https://doi.org/10.23695/svn8-rt31
BibTeX Ytterligare sätt att citera datamängden.
Learner Language är en samling av korpusar och lexikala resurser som beskriver inlärarspråket. Korpusar inkluderar både texter/audio som produceras av de som lär sig språket, och texter/språket som de utsätts för (läser eller hör, t.ex. kursböcker). Även en del derivata resurser utifrån dessa korpusar är med i denna samling.

Included resources

  • COCTAILL - is a corpus of course booksused for teaching Swedish as a second language at CEFR levels A1, A2, B1, B2, and C1.
  • SweLL-gold is a second language learner corpus, featuring pseudonymization, normalization and correction-annotation.
  • SweLL-pilot is a second language learner corpus, featuring CEFR labeling.
  • DaLAJ resources are a collection of sentence pairs (original - corrected) containing one error each.
  • MultiGED -- Multilingual Grammatical Error Detection - is a dataset for grammamatical error detection, featuring five languages (Czech, German, English, Italian, Swedish). The data is organized by sentences, where each token has an annotation whether it is correct or incorrect (c or i). The corrected version is not provided. MultiGED has been used for a shared task (https://spraakbanken.github.io/multiged-2023/)
  • MuClaGED -- Multi-Class Grammatical Error Detection - is a dataset for Swedish only, organized by sentences, each incorrect token associated with the type of correction (Orthography, Syntax, Morphology, etc.) and the type of edit (Addition, Deletion, Replacement)
  • MultiGEC -- Multilingual Grammatical Error Correction is a dataset for grammamatical error detection, featuring twelve languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian). The data is organized by essay pairs (original - corrected). MultiGEC has been used for a shared task (https://spraakbanken.github.io/multigec-2025/)
  • SVALex is a wordlist generated from the COCTAILL corpus.
  • SweLLex is a wordlist generated from the SweLL-pilot essays.
  • Sen*Lex is a sense-based wordlist, combining SVALex and SweLLex in one.
  • CoDeRooMor is a morphologically annotated list based on Sen*Lex, featuring annotations for word-building morphemes (roots, prefixes, suffixes etc) and word-building mechanisms (affixation, compounding, etc.).
  • Kelly - is a wordlist covering ca 8.000 most frequent words in the web texts, and assigned to the six CEFR levels based on the frequency information.

Avsedd användning

Research, development and pedagogical applications within (second) language acquisition and intelligent computer-assisted language learning

Referenser

Datamängder i samlingen

Typ

  • Korpus
  • Samling

Språk

svenska
flera språk

Storlek

Resurser: 14

Nyckelord

  • second language
  • learner language
  • language learning
  • essays
  • course books
  • word lists

Updaterad

2025-01-19

Kontakt

Språkbanken
sb-info@svenska.gu.se