L2 data

1. COCTAILL - a corpus of L2 Swedish coursebooks


Since the acceptance of Common European Framework of References for Languages (CEFR) in 2001 (Council of Europe, 2001) many countries inside and outside Europe have abandoned previous practices in second language (L2) teaching and assessment in favour of CEFR. CEFR scale, consisting of 6 proficiency levels, is described intentionally vaguely to cater for the diversity of different languages. As a consequence, there are voices among researchers and educators demanding explicit interpretation of each proficiency level for each individual language in terms of required vocabulary scope, grammatical competence, etc. (Byrnes 2007; Little 2007; Little 2011; Milton 2009; North 2007; Westhoff 2007).

CEFR “can-do” statements are known to offer flexibility in interpreting them for different languages and target groups. However, they are non-specific and therefore it is difficult to associate different kinds of competencies and levels of accuracy learners need in order to perform the communicative tasks with different CEFR levels. To address this problem a systematic study needs to be performed for each individual language, both for “input” normative texts and “output” learner-produced texts. In this project we take the first step to collect and study normative texts for Swedish, which we define as texts used for reading comprehension, for example the ones used in course books.

The research agenda behind compiling such a corpus comprises the study of normative “input” texts that can reveal a number of facts about what is being taught in terms of explicit grammar, receptive vocabulary, text and sentence readability; as well as build insights into linguistic characteristics of normative texts which can help anticipate learner performance in terms of active vocabulary, grammatical competence, etc. in classroom and testing settings.

The main research questions are the following:

  • which linguistic aspects are important at each particular CEFR level, and why? which of them make the most reliable “predictors” of level complexity? These aspects will be studied separately for text and sentence levels
  • how texts of different thematic domains can be automatically identified? which are the most reliable linguistic parameters in topic identification?
  • which receptive vocabulary students are mostly getting exposed to and hence which words and how many per level are important to learn?
  • which grammar students are mostly getting exposed to and therefore become the focus of extra training at each particular level?

During 2013-2014 a corpus containing approximately 4 coursebooks per level at 5 of CEFR levels (A1, A2, B1, B2 and C1) have been collected, scanned, OCR-ed, manually checked and annotated for pedagogical and textual features and automatically annotated for linguistic features. Center for Language Technology (CLT,, Swedish Language Bank (Språkbanken) and Department of Swedish (, all at the University of Gothenburg (UGOT), have financed this work.


Publications, COCTAILL-related

2. SweLL-pilot, a corpus of Swedish L2 learner essays


The need to study developmental stages of learner language is transparent within language learning, language assessment, L2 material development etc. However, there is a lack of L2 Swedish essays that have any metadata on levels of language development. In this project, we set out to collect, digitize and linguistically annotate a collection of L2 essays that have been linked to reached CEFR levels (A1, A2, B1, B2, C1). This is planned to be the first step in creating an electronic infrastructure for research in Swedish as a Second Language.

Availability of learner essays linked to CEFR levels will facilitate answering the following questions:

  • which linguistic aspects disclose in the most reliable manner the student level? This will result in an instrument for automatic linking of learner-produced texts to relevant CEFR level, i.e. estimating connection between learner performance and reached proficiency level (competence)
  • which productive vocabulary students at this level can be expected to demonstrate. The estimated vocabulary scope will make it possible to make predictions about how many words per level (good) students of Swedish can be expected to acquire and thus influence test practices.
  • which productive grammar students are able to demonstrate
  • features disclosing sentence and text readability at different proficiency levels is the other strand of research

Work on collection of essays is ongoing since 2013, with several schools that contribute with essays to the SweLL L2 collection.

If you are willing to assist in collecting (CEFR-related) L2 learner essays, please contact Elena Volodina at email address elena dot volodina at svenska dot gu dot se

Students need to sign permission forms before their essays can be collected. Please, find forms here: in English and Swedish

General principles for research ethics in Social Sciences are described in the paper from the Swedish Research Council (in Swedish)


Publications, SweLL-pilot-related

