L2 data

1. COCTAILL - a corpus of L2 Swedish coursebooks


Since the acceptance of Common European Framework of References for Languages (CEFR) in 2001 (Council of Europe, 2001) many countries inside and outside Europe have abandoned previous practices in second language (L2) teaching and assessment in favour of CEFR. CEFR scale, consisting of 6 proficiency levels, is described intentionally vaguely to cater for the diversity of different languages. As a consequence, there are voices among researchers and educators demanding explicit interpretation of each proficiency level for each individual language in terms of required vocabulary scope, grammatical competence, etc. (Byrnes 2007; Little 2007; Little 2011; Milton 2009; North 2007; Westhoff 2007).

CEFR “can-do” statements are known to offer flexibility in interpreting them for different languages and target groups. However, they are non-specific and therefore it is difficult to associate different kinds of competencies and levels of accuracy learners need in order to perform the communicative tasks with different CEFR levels. To address this problem a systematic study needs to be performed for each individual language, both for “input” normative texts and “output” learner-produced texts. In this project we take the first step to collect and study normative texts for Swedish, which we define as texts used for reading comprehension, for example the ones used in course books.

The research agenda behind compiling such a corpus comprises the study of normative “input” texts that can reveal a number of facts about what is being taught in terms of explicit grammar, receptive vocabulary, text and sentence readability; as well as build insights into linguistic characteristics of normative texts which can help anticipate learner performance in terms of active vocabulary, grammatical competence, etc. in classroom and testing settings.

The main research questions are the following:

  • which linguistic aspects are important at each particular CEFR level, and why? which of them make the most reliable “predictors” of level complexity? These aspects will be studied separately for text and sentence levels
  • how texts of different thematic domains can be automatically identified? which are the most reliable linguistic parameters in topic identification?
  • which receptive vocabulary students are mostly getting exposed to and hence which words and how many per level are important to learn?
  • which grammar students are mostly getting exposed to and therefore become the focus of extra training at each particular level?

During 2013-2014 a corpus containing approximately 4 coursebooks per level at 5 of CEFR levels (A1, A2, B1, B2 and C1) have been collected, scanned, OCR-ed, manually checked and annotated for pedagogical and textual features and automatically annotated for linguistic features. Center for Language Technology (CLT,, Swedish Language Bank (Språkbanken) and Department of Swedish (, all at the University of Gothenburg (UGOT), have financed this work.


Publications, COCTAILL-related

  • Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. To appear in Proceedings of LREC 2016, Slovenia. [pdf]
  • Elena Volodina, Ildikó Pilán, Thomas François & Anaïs Tack. 2016. SVALex: en andraspråksordlista graderad enligt CEFR nivåer. Proceedings of Svenskans beskrivning 35, Göteborg 2016. [pdf]
  • Ildikó Pilán. Detecting Context Dependence in Exercise Item Candidates Selected from Corpora. Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications 2016, NAACL, San Diego. [pdf]
  • Pilán, Ildikó, Sowmya Vajjala, Elena Volodina. 2015. A readable read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity. To appear in International Journal of Computational Linguistics and Applications (IJLCA). [pdf]
  • Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson. 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144. [pdf]
  • Elena Volodina, Ildikó Pilán. 2014. Annotation guide for the Course book editor in Lärka. Inst. för svenska språket. [pdf]
  • Ildikó Pilán, Elena Volodina and Richard Johansson. 2014. Rule-based and machine learning approaches for second language sentence-level readability. Proceedings of the 9th workshop on Building Educational Applications Using NLP, ACL 2014. [pdf]
  • Pilán, I. Volodina, E. and Johansson, R. 2013. Automatic selection of suitable sentences for language learning exercises. In: 20 Years of EUROCALL: Learning from the Past, Looking to the Future. 2013 EUROCALL Conference, Évora, Portugal, Proceedings [pdf]
  • Volodina, E. & Johansson Kokkinakis, S. 2013. Compiling a corpus of CEFR-related texts. Proceedings of the Language Testing and CEFR conference, Antwerpen, Belgium, May 27-29, 2013. [pdf, p.248-259]


2. SweLL-pilot, a corpus of Swedish L2 learner essays


The need to study developmental stages of learner language is transparent within language learning, language assessment, L2 material development etc. However, there is a lack of L2 Swedish essays that have any metadata on levels of language development. In this project, we set out to collect, digitize and linguistically annotate a collection of L2 essays that have been linked to reached CEFR levels (A1, A2, B1, B2, C1). This is planned to be the first step in creating an electronic infrastructure for research in Swedish as a Second Language.

Availability of learner essays linked to CEFR levels will facilitate answering the following questions:

  • which linguistic aspects disclose in the most reliable manner the student level? This will result in an instrument for automatic linking of learner-produced texts to relevant CEFR level, i.e. estimating connection between learner performance and reached proficiency level (competence)
  • which productive vocabulary students at this level can be expected to demonstrate. The estimated vocabulary scope will make it possible to make predictions about how many words per level (good) students of Swedish can be expected to acquire and thus influence test practices.
  • which productive grammar students are able to demonstrate
  • features disclosing sentence and text readability at different proficiency levels is the other strand of research

Work on collection of essays is ongoing since 2013, with several schools that contribute with essays to the SweLL L2 collection.

If you are willing to assist in collecting (CEFR-related) L2 learner essays, please contact Elena Volodina at email address elena dot volodina at svenska dot gu dot se

Students need to sign permission forms before their essays can be collected. Please, find forms here: in English and Swedish

General principles for research ethics in Social Sciences are described in the paper from the Swedish Research Council (in Swedish)


Publications, SweLL-pilot-related

  • Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. [pdf]
  • Elena Volodina, Ildikó Pilán, Ingegerd Enström, Peter Lundkvist, Gunlög Sundberg, Lorena Llozhi, Monica Sandell. 2016. SweLL – en korpus med L2 uppsatser för CEFR studier. Proceedings of Svenskans beskrivning 35, Göteborg 2016. [pdf]
  • Lorena Llozhi. 2016. SWELL LIST. A list of productive vocabulary generated from second language learners' essays. Master Thesis in Language Technology, University of Gothenburg. [pdf]