1. L2 productive lexicon - SweLLex
SweLL stands for Swedish Learner Language. In this project, we have created a vocabulary list generated from SweLL corpus, which consists of a number of of second language (L2) learner essays collected into an electronic corpus. Each entry in the SweLLex consists of a lemma and a part-of-speech (POS) combination and their frequency counts. For each entry, raw and normalized frequencies can be observed, both in the corpus as a whole as well as frequencies for each CEFR level (Council of Europe, 2001). As such, the list is descriptive and demonstrates distribution of the vocabulary across CEFR levels.
Availability of SweLLex allows to analyze productive vocabulary L2 learners of Swedish demonstrate in essay writing. It allows to observe relation between the receptive vocabulary at different levels of L2 language development as captured in the SVALex list and the productive vocabulary as captured in the SweLLex. Besides, we could observe spelling deviations in the L2 writing and experiment with Levenstein distance as a way to word-level normalization.
The immediate intended use for the list is foreseen in essay grading algorithms for classification of Swedish L2 essays by reached proficiency levels as well as in automated exercise generation where target vocabulary scope is necessary for generation of appropriate exercise items.
Future refinement of the list consists in enriching the items on the list with additional grammatical, topical and lexicographic information; as well as assigning a single target level at which the word is expected to become active. Besides, a user-friendly website will be implemented where the list can be searched, downloaded and compared between distributions of the same lexical item in SVALex and SweLLex.
Browse/download the list here
SweLLex is a part of a collection of CEFRLex resources
- Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, Thomas François. SweLLex: second language learners' productive vocabulary. To appear in Proceedings of the workshop on NLP4CALL&LA. NEALT Proceedings Series / Linköping Electronic Conference Proceedings [pdf]
- David Alfter, Yuri Bizzoni, Anders Agebjörn, Elena Volodina and Ildikó Pilán. From Distributions to Labels: A Lexical Proficiency Analysis using Learner Corpora. To appear in Proceedings of the workshop on NLP4CALL&LA. NEALT Proceedings Series / Linköping Electronic Conference Proceedings
- Lorena Llozhi. 2016. SWELL LIST. A list of productive vocabulary generated from second language learners' essays. Master Thesis in Language Technologies. Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg.
2. L2 receptive list - SVALex
This project builds upon the work done in the Kelly project during 2009-2011 (https://spraakbanken.gu.se/eng/kelly), an EU-funded project on building learner-oriented frequency-based monolingual and bilingual word lists for 9 languages for use in a commercial language learning tool (Kilgarriff et.al. 2014). During the post-Kelly period we have identified a number of weaknesses in the Swedish list which we intended to address in this project:
- validity of the list as far as streaming of vocabulary into CEFR levels is concerned;
- presence of relevant vocabulary per CEFR level, namely questions like which vocabulary should be added, removed or relocated in the list with regards to CEFR guidelines?
- domain-specific vocabulary according to CEFR themes – which words, which levels, how many per level?
Browse/download the list here
SVALex is a part of a collection of CEFRLex resources
Project description and financing
To address these issues, our main steps have up to now included:
- Compiling a corpus of CEFR-based reading comprehension texts, COCTAILL corpus (2013-2014)
- Generation of SVALex, a wordlist graded for CEFR levels from COCTAILLx. This list reflects distribution of vocabulary across CEFR levels (2015-2016)
The steps that will be addressed in the future include:
- Streaming SVALex lexical items into target/peripheral vocabulary at each level
- Comparison of SVALex items with Kelly for overlaps and consistency in levels, the work that should produce a merged Kelly-SVALex list
- Enriching the merged Kelly-SVALex resource with domain etc information
This work is financed by Department of Swedish (UGOT) through financing a pilot project on "Kelly validation"; as well as by Center for Language Technology (CLT, UGOT) and Swedish Language Bank (Språkbanken, UGOT); .
This project will help us identify (concrete) lexical curriculum for CEFR-based courses in Swedish, both in terms of WHAT words and HOW MANY per level a student of each level should acquire. The resulting list can be used as an instrument for training vocabulary, e.g. in an exercise/test generators like Lärka (https://spraakbanken.gu.se/l); for testing authentic examples (texts and sentences) for appropriateness for learners of different proficiency levels; for assessment of language proficiency in L2 learner language production, etc.
- Elena Volodina, Ildikó Pilán, Thomas François & Anaïs Tack. 2016. SVALex: en andraspråksordlista graderad enligt CEFR nivåer. Proceedings of Svenskans beskrivning 35, Göteborg 2016. [pdf]
- Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia. [pdf]
- Pilán, Ildikó, Sowmya Vajjala, Elena Volodina. 2015. A readable read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity. To appear in International Journal of Computational Linguistics and Applications (IJLCA). [pdf]
- Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144. [pdf]
3. Kelly word lists
Kelly project is described here.