SweLLex description

SweLL stands for Swedish Learner Language. In this project, we have created a vocabulary list generated from SweLL corpus, which consists of a number of of second language (L2) learner essays collected into an electronic corpus. Each entry in the SweLLex consists of a lemma and a part-of-speech (POS) combination and their frequency counts. For each entry, raw and normalized frequencies can be observed, both in the corpus as a whole as well as frequencies for each CEFR level (Council of Europe, 2001). As such, the list is descriptive and demonstrates distribution of the vocabulary across CEFR levels.

Availability of SweLLex allows to analyze productive vocabulary L2 learners of Swedish demonstrate in essay writing. It allows to observe relation between the receptive vocabulary at different levels of L2 language development as captured in the SVALex list and the productive vocabulary as captured in the SweLLex. Besides, we could observe spelling deviations in the L2 writing and experiment with Levenstein distance as a way to word-level normalization.

The immediate intended use for the list is foreseen in essay grading algorithms for classification of Swedish L2 essays by reached proficiency levels as well as in automated exercise generation where target vocabulary scope is necessary for generation of appropriate exercise items.

Future refinement of the list consists in enriching the items on the list with additional grammatical, topical and lexicographic information; as well as assigning a single target level at which the word is expected to become active. Besides, a user-friendly website will be implemented where the list can be searched, downloaded and compared between distributions of the same lexical item in SVALex and SweLLex.

SweLLex is a part of a collection of CEFRLex resources



2. L2 receptive list - SVALex



This project builds upon the work done in the Kelly project during 2009-2011 (, an EU-funded project on building learner-oriented frequency-based monolingual and bilingual word lists for 9 languages for use in a commercial language learning tool (Kilgarriff 2014). During the post-Kelly period we have identified a number of weaknesses in the Swedish list which we intended to address in this project:

  1. validity of the list as far as streaming of vocabulary into CEFR levels is concerned;
  2. presence of relevant vocabulary per CEFR level, namely questions like which vocabulary should be added, removed or relocated in the list with regards to CEFR guidelines?
  3. domain-specific vocabulary according to CEFR themes – which words, which levels, how many per level?

Browse/download the list here

Project description and financing

To address these issues, our main steps have up to now included:

  • Compiling a corpus of CEFR-based reading comprehension texts, COCTAILL corpus (2013-2014)
  • Generation of SVALex, a wordlist graded for CEFR levels from COCTAILLx. This list reflects distribution of vocabulary across CEFR levels (2015-2016)

The steps that will be addressed in the future include:

  • Streaming SVALex lexical items into target/peripheral vocabulary at each level
  • Comparison of SVALex items with Kelly for overlaps and consistency in levels, the work that should produce a merged Kelly-SVALex list
  • Enriching the merged Kelly-SVALex resource with domain etc information

This work is financed by Department of Swedish (UGOT) through financing a pilot project on "Kelly validation"; as well as by Center for Language Technology (CLT, UGOT) and Swedish Language Bank (Språkbanken, UGOT); .

Research issues

This project will help us identify (concrete) lexical curriculum for CEFR-based courses in Swedish, both in terms of WHAT words and HOW MANY per level a student of each level should acquire. The resulting list can be used as an instrument for training vocabulary, e.g. in an exercise/test generators like Lärka (; for testing authentic examples (texts and sentences) for appropriateness for learners of different proficiency levels; for assessment of language proficiency in L2 learner language production, etc.


3. Kelly word lists


