Skip to main content

L2 linguistic complexity

A number of bigger and smaller projects are run under the header of Linguistic complexity in Second Language Learning context, among others:


1. L2 Lexical Complexity


(PhD project by David Alfter, 2016-2021; PhD supervisors: Elena Volodina and Lars Borin)

Partly overlaps with L2 profiles project and its experiment on ranking Multi-Word Expressions


When automatically creating exercises aimed at learners of different proficiency levels, it is important to understand what vocabulary learner can and cannot deal with.

Graded word lists are also useful for coursebook writers, for learner dictionaries, readability assessment, among others.

Research issues

The aim of this research is to link words and expressions to target levels of the Common European Framework of Reference (CEFR). An assigned CEFR level is to be understood as the level at which a learner has to be in order to understand a word or expression.

However, it is not enough to simply take a list of words and link them to CEFR levels, as such a list cannot be exhaustive, i.e. there will always be items missing from the list. In order to address this issue, we have taken the lists and applied machine learning algorithms that learn to assign CEFR levels to unseen words.

Furthermore, it is important to distinguish between different senses of a word or expression. Indeed, it is implausible that a learner will learn all senses of a word at the same level. Thus, different senses can (and in most cases should be) assigned different CEFR levels.


  • David Alfter, Therese Lindström Tiedemann and Elena Volodina. 2019. LEGATO: A flexible lexicographic annotation tool. Nodalida 2019, Turku, Finland. LiUP Press. [pdf]
  • David Alfter, Lars Borin, Ildikó Pilán, Therese Lindström Tiedemann, Elena Volodina. 2019. From Language Learning Platform to Infrastructure for Research on Language Learning. CLARIN-2018 post-conference volume. LiUP Press. [pdf]
  • David Alfter & Elena Volodina. (2018). Is the whole greater than the sum of its parts? A corpus-based pilot study of the lexical complexity in multi-word expressions. Proceedings of SLTC-2018, Stockholm, Sweden
  • Alfter, David, & Volodina, Elena (2018) Towards Single Word Lexical Complexity Prediction. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 79-88) at NAACL-2018 [pdf]
  • David Alfter, Yuri Bizzoni, Anders Agebjörn, Elena Volodina, & Ildikó Pilán. (2016). From distributions to labels: A lexical proficiency analysis using learner corpora. In Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016 (No. 130, pp. 1-7). Linköping University Electronic Press. [pdf]


2. L2 Sentence and Text Readability

(aka Linguistic Complexity)


Linguistic complexity on sentence and text levels (PhD project by Ildikó Pilán, 2013-2018; PhD supervisors: Elena Volodina and Lars Borin)


Selection of authentic examples that can appropriately demonstrate vocabulary items of interest is a vital question for lexicographers and second/foreign language (L2) teachers. At present it is often unknown for instance, on what principles dictionary examples are selected or where examples for illustrating new vocabulary for L2 learners come from. One way of providing examples is to make them up – they are then as typical as the person that comes up with them thinks they should be, but they lack authenticity. Another way is to use some source of authentic texts, e.g. a linguistic corpus, and select examples using concordance software. The only constraint set on the corpus hits is then the occurrence of the target word in the text span (as opposed to sentence) which makes the number of hits often innumerable. In this case examples are authentic, but the selection process can be very tedious and the quality of “candidate” examples can be very different. One more option is to pre-select sentences automatically using a number of constraints downgrading inappropriate samples. The user is then offered top candidate samples he or she can choose from. The resulting list of ranked candidate sentences can be used for further manual or automatic selection (or editing) of top high-quality sentences, reducing the costs and time spent on manual pre-selection of those. The candidate examples can be used: for dictionary entries; to illustrate language features for students of Linguistics; to exemplify vocabulary for language learners; to create test items for L2 learners; to accompany electronic texts (e.g. via clicking on the unknown word the user can see another example of the usage of this word). The ranking algorithm can eventually be used to test web texts for appropriateness for inclusion into a corpus.

The target user groups are therefore lexicographers, L2 teachers, teachers of Linguistics, test item creators, designers of electronic course materials and corpus linguists.

Research issues

The question arising in this connection is whether we can comprehensively describe and model “good examples”. This question has been addressed in different studies (Kilgariff 2008, Husák 2008, Kosem 2011, Segler 2007, etc.), and lately even for Swedish as a target language (Borin 2012a; Volodina, 2012; 2013). Our starting point is that parameters of good examples are language dependent and need to be tested for each language separately.

The issue of sentence readability, as opposed to text readability, has not been a topic of any systematic research so far. Within lexicography the quality of examples has been well-documented, but often the parameters described there are difficult to model for computer applications. In this research we plan to single out the parameters defining sentence readability for three user groups - lexicographers, L2 teachers, teachers of Linguistics; and suggest a readability measure for testing sentences for their appropriateness for the user groups.



  • Ildikó Pilán, Elena Volodina, Lars Borin. 2017. Candidate sentence selection for language learning exercises: from a comprehensive framework to an empirical evaluation. TAL Journal: Special issue NLP for learning and Teaching. Volume 57, Number 3. [pre-print]
  • Ildikó Pilán. 2016. Detecting Context Dependence in Exercise Item Candidates Selected from Corpora. Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications 2016, NAACL, San Diego. [pdf]
  • Pilán, Ildikó, Sowmya Vajjala, Elena Volodina. 2015. A readable read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity. To appear in International Journal of Computational Linguistics and Applications (IJLCA). [pdf]
  • Ildikó Pilán, Elena Volodina and Richard Johansson. 2014. Rule-based and machine learning approaches for second language sentence-level readability. Proceedings of the 9th workshop on Building Educational Applications Using NLP, ACL 2014. [pdf]
  • Pilán, I. Volodina, E. and Johansson, R. (2013). Automatic selection of suitable sentences for language learning exercises. In: 20 Years of EUROCALL: Learning from the Past, Looking to the Future. 2013 EUROCALL Conference, Évora, Portugal, Proceedings [pdf]
  • Elena Volodina, Richard Johansson, Sofie Johansson Kokkinakis. 2012. Semi-automatic selection of best corpus examples for Swedish: Initial algorithm evaluation. Workshop on NLP in Computer-Assisted Language Learning. Proceedings of the SLTC 2012 workshop on NLP for CALL. Linköping Electronic Conference Proceedings 80: 59–70. [pdf]

3. Automatic L2 Essay Grading and Assessment


A project by Ildikó Pilán (part of PhD), David Alfter, Elena Volodina - in collaboration with Torsten Zesch


Learner essay grading presents a lot of challenges, especially in terms of manual assessment time and qualification of assessors. Human assessment is precise and reliable provided that assessors are well trained. However, their judgements can also be subject to different outside factors, such as hunger, bad mood, negative attitude to a learner or boredom. The same essay can be graded differently depending upon outside influences on an assessor's mood. To avoid misjudgements and to ensure objectivity, certain institutions have started to complement human grading by automatic assessment as a reference point?, e.g. ETS (Burstein 2003, Burstein & Chodorow 2010).


Developing an automatic essay grading (AEG) system is a non-trivial task which needs to rely on data consisting of essays that have been manually graded by human assessors, a set of rules and specific features that can be used to predict grades or levels, and a classification algorithm. AEG tasks have been addressed previously in a number of projects, e.g. by Östling et al. (2013) for Swedish, Hancke & Meurers (2013) for German, Burstein & Chodorow (2010) for English, Vajjala & Lõo (2014) for Estonian, etc. Östling et al. (2013) have looked at Swedish upper secondary school essays (mostly L1) and assessed them in performance grades (VG, G, IG) as opposed to reached proficiency levels in L2 Swedish as our intentions go. However, only in few cases such systems are used for real-life assessment and go beyond prototype development.

Project description

Availability of data is critical for AEG experiments. We are using SweLL-pilot (Volodina et al., 2016), a corpus consisting of second language (L2) Swedish essays, linked to reached levels as defined by Common European Framework of References (COE 2001). Our experiments cover automatic ranking of SweLL essays to predict at which CEFR level (A1, A2, B1, B2, C1) an essay is written. The CEFR levels have been selected since the CEFR is very influential in Europe and outside with numerous projects targeting interpretation of CEFR scales (e.g. Hancke & Meurers, 2013; Vajjala & Lõo, 2014), however, very little work has been done for CEFR-based L2 Swedish.

Selection of features is the most important and time-consuming part of AEG projects. Features can be language independent, such as n-grams, sentence and word length, or language specific, such as language models, out-of-vocabulary words (where vocabulary is defined as some lexicon or word list), etc. Our experiments include empiric analysis of data, extraction of relevant features for machine learning experiments or heuristic rules and experimentation with those to select the most predictive ones. Our intention is to test language independent versus language specific models to see how language specific features change the quality of predictions. In this project we collaborate with University of Duisburg-Essen, where Prof. Zesch's group is testing their language independent model on our data, while we develop language specific approaches.

For users, we intend to set up an interface for assessing new essays and for providing feedback to users as far as certain groups of features are concerned (e.g. lexical, grammatical, readability, etc). The first prototype is already under development (EuroCALL article + Pilan/Zesch article)


Deparment of Swedish (UGOT), Swedish Language Bank (Språkbanken, UGOT) and Swe-CLARIN are co-financing work on this project.


  • Ildikó Pilán, Elena Volodina and Torsten Zesch. 2016. Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. To appear in Proceedings of the 26th International Conference on Computational Linguistics (COLING), 2016, Osaka, Japan. [pdf]
  • Ildikó Pilán, Elena Volodina and David Alfter. 2016. Coursebook texts as a helping hand for classifying linguistic complexity in language learners' writings. To appear in Proceedings of the workshop on Computational Linguistics for Linguistic Complexity (CL4LC), COLING 2016, Osaka, Japan.
  • Ildikó Pilán, Elena Volodina. Classification of Language Proficiency Levels in Swedish Learners' Texts. 2016. Proceedings of SLTC 2016, Umeå, Sweden
  • Elena Volodina, Ildikó Pilán, David Alfter. 2016. Classification of Swedish learner essays by CEFR levels. To appear in Proceedings of EuroCALL 2016, Cyprus.