SweLL - Infrastructure for L2 Swedish

Umbrella project

ICALL - Intelligent Computer-Assisted Language Learning

Full name: SweLL - research infrastructure for Swedish as a second language, RJ, 2017-2020

(SweLL - Swedish Learner Language)

Background

With growing number of people seeking asylum in Sweden, the need for second language (L2) teaching and the evolvement of such a practice is of great importance to the Swedish society. The government has recently initiated a project on learning among newly arrived. One of the foci of this project is on producing tools for evaluation of L2 Swedish, an aim to which the SweLL project contributes in a most robust way.

General description

The purpose of SweLL is to set up an infrastructure for collection, digitization, normalization, and annotation of learner production, as well as to make available a linguistically annotated corpus of approx. 600 L2 learner texts. Such a corpus would make it possible to search for various types of linguistic structures, without the researcher having to guess what such a structure might look like, since there is a parallel normalized version available. L2 corpora are available for many other languages, but for Swedish such a resource is lacking.

Aims of the project

To fill the needs of the L2 research field, SweLL will create an infrastructure consisting of:

a data collection portal, through file import and via online exercises
methods and tools for L2 analysis
an annotated corpus of L2 production
specific search tools for L2-material facilitating filtering for e.g. texts written by male writers or writers at a certain proficiency level.

The material and tools will be made accessible for research. Apply for access.

See an interview about SweLL data with Elena Volodina (April 2018)

See an interview about importance of interoperability of L2 resources and tools with Elena Volodina (October 2018)

Institutions/organisations

Project leader: Elena Volodina, Språkbanken, University of Gothenburg

Four universities participate in the project:

University of Gothenburg, coordinating partner: Julia Prentice, Monica Reichenberg, Elena Volodina,
Uppsala University: Beata Megyesi
Stockholm University: Lisa Rudebeck,Gunlög Sundberg, Mats Wirén
Umeå University: Lena Granstedt

Financing

The project is financed by Riksbankens Jubileumsfond during years 2017-2019 through a grant IN16-0464:1

Co-financing comes from

University of Gothenburg, Department of Swedish, Multilingualism, Language Technology, as a contribution to development of university infrastructure activities
Nationella Språkbanken -- jointly funded by its 10 partner institutions and the Swedish Research Council (2018--2024; dnr 2017-00626).
Department of Swedish Language and Multilingualism at Stockholm University - through a networking grant

Presentations

Elena Volodina (Spring 2021, Department of Swedish, UGOT) SweLL learner corpus: statistics, Korp access and more.
Elena Volodina, Yousuf Ali Mohammed, Sandra Derbring, Arild Matsson and Beata Megyesi (December 2020), COLING-2020, poster presentation. Towards privacy by design in learner corpora research: A case of on-the-fly pseudonymization of Swedish learner essays.
Elena Volodina (2020-11-25), NLP4CALL 2020, replacement keynote talk. Pseudonymization of learner corpora. [Slides]
Elena Volodina (2020-09-23), Baltic HLT 2020, keynote talk. Learner corpora – overcoming challenges with building and sharing the data. [Slides]
Elena Volodina (September 2019). NLP4CALL, organizer talk. SVALA - pseudonymization service for L2 Swedish . [Slides]
Elena Volodina, Arild Matsson (Dan Rosén and Mats Wirén) (September 2019). SVALA: an Annotation Tool for Learner Corpora generating parallel texts. Learner Corpus Research Conference-2019. Poland, Warszawa. [Slides]
Rudebeck, Lisa, Sundberg, Gunlög & Wirén, Mats. (12 februari 2019) SweLL: En forskningsinfrastruktur för svenska som andraspråk, Högre seminariet vid Institutionen för språkdidaktik, Stockholms universitet
Rudebeck, Lisa, Sundberg, Gunlög & Wirén, Mats. (10 april 2019). SweLL: En forskningsinfrastruktur för svenska som andraspråk, Högre seminariet vid Institutionen för för svenska och flerspråkighet, Stockholms universitet. (se pdf)
Lena Granstedt, Julia Prentice, Lisa Rudebeck & Gunlög Sundberg:(August 2019). Annotating Swedish Learner Language. Insights from designing and implementing the SweLL correction taxonomy. The 29th Conference of the European second language association EuroSLA, Lund 29-31 augusti 2019. (se abstract i pdf)
Elena Volodina, Lena Granstedt, Beáta Megyesi, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg & Mats Wirén. (2018). Annotation of learner corpora: first SweLL insights. Proceedings of SLTC-2018, Stockholm, Sweden [pdf]
Ildikó Pilán (and Elena Volodina). Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors. TPoster presentation at the COLING 2018 SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH).
Elena Volodina Towards a research infrastructure for Second Language Acquisition and teaching: case of L2 Swedish Guest talk at the University of Ljubljana, Slovenia, June 7 2018. [slides]
Elena Volodina Annotation of L2 corpora for NLP and SLA studies: case of SweLL, keynote talk at INDUS network meeting, Tübingen, Germany, 28 Feb-1 March 2018 [Slides]
Mats Wirén SweLL - an upcoming infrastructure for Swedish as a Second Language, Clarin L2 workshop, Gothenburg, 6-8 Dec 2018 [Slides]
Elena Volodina Legal issues in learner essay collection, Clarin L2 workshop, Gothenburg, 6-8 Dec 2018 [Slides]
Julia Prentice Error taxonomy and other considerations in the SweLL project, Clarin L2 workshop, Gothenburg, 6-8 Dec 2018 [Slides]
Dan Rosén The SweLL normalization editor for learner texts, Clarin L2 workshop, Gothenburg, 6-8 Dec 2018 [Slides] [More]
Prentice, Julia & Volodina, Elena. SweLL - Forskningsinfrastuktur för svenska som andraspråk. Konferensbidrag på Svenskans beskrivning 36, Uppsala universitet, 25-27 oktober 2017.[Slides]
"Infrastruktur för svensk andraspråksforksning (och annan svensk forskning). Möten mellan andraspråksforksning och datalingvistik". Arbeitstagung der Skandinavistik (ATDS). KIEL, 27-29 September 2017. (by Julia Prentice) [slides]
"Legal and ethical issues when dealing with learner essays" - Presentation at an NCN workshop (Nordic CLARIN Network), September 2017. (by Elena Volodina)
"Situation and legal problems with collecting learner texts" - Presentation at a meeting of enet-COLLECT, September 2017, Bolzano, Italy. (by Elena Volodina) [slides]
"SweLL in a nutshell" - Presentation at Språkbanken's internal talk series, August, 2017, Gothenburg, Sweden. (by Elena Volodina) [slides]
"Crowdsourcing Second Language learner data: experiences and prospects" - Presentation at a meeting of a European Network of e-Lexicography, February 2017, Budapest, Hungary (by Elena Volodina). [slides]
"SweLL - forskningsinfrastruktur för svenska som andraspåk" - Presentation at Swedish Language Council (Språkrådet), February 2017, Stockholm, Sweden. (by Elena Volodina) [slides]
"A Friend in Need? Research agenda for electronic Second Language infrastructure" - Presentation at SLTC, november 2016, Umeå, Sweden. (by Elena Volodina) [slides]

Blogs

Pseudonymization of learner essays as a way to meet GDPR requirements: https://spraakbanken.gu.se/blogg/index.php/2020/10/27/pseudonymization-… (October 2020)
Korp searches in Second Language data: https://spraakbanken.gu.se/blogg/index.php/2020/06/17/korp-searches-in-… (June 2020)
Interoperability of second language resources and tools: https://www.clarin.eu/news/blog-post-elena-volodina-clarin-workshop-int… (2018-01-24)

Special events

National Swe-CLARIN workshop on searches in digital L2 resources. May, 2018, Stockholm, Sweden. [Website]
International CLARIN workshop on Interoperability of L2 resources and tools. December, 2017, Gothenburg, Sweden. [Website]

Visions and plans

2017

Setting up guidelines drafts (transcription, normalization, correction annotation, code taxonomy, pseudonymization & pseudo-taxonomy, data handling flow, step-by-step involvemennt of schools)
Development of tool prototypes (kiosk, SVALA, portal).
Setting up collaboration with schools.
Settling legal and ethical issues.

2018

Testing guidelines and tools internally within the project group, improvements, iterations, finalizing.
Translation of metadata forms
Initiating essay collection from schools according to the decided flow, metadata forms, incl. use of translated forms

2019

Full-scale essay collection
Initial essay annotation.
Tool maintenance.
Further development of tool functionalities.

2020

Full-scale annotation (pseudonymization, normalization, correction annotation).
(Minimal) search and visualization in Korp (and ev in Strix).
Import from SweLL-portal to Korp.
Download of data from SweLL portal and
Upload of new data to SweLL portal.
Release in spring 2021.

Future plans:

extention of SVALA tool to other languages
collection of new essays via Lärka (with pseudonymization on-the-fly) to bypass kiosk steps (transcription, pseudonymization, fillinf in metadata forms)

Publications

2019

David Alfter, Lars Borin, Ildikó Pilán, Therese Lindström Tiedemann, Elena Volodina (2019): Lärka: From Language Learning Platform to Infrastructure for Research on Language Learning, in Linköping Electronic Conference Proceedings
Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén (2019): The SweLL Language Learner Corpus: From Design to Annotation, in Northern European Journal of Language Technology, volume 6, pages 67-104
Elena Volodina, Arild Matsson, Dan Rosén, Mats Wirén (2019): SVALA: an Annotation Tool for Learner Corpora generating parallel texts, in Learner Corpus Research conference (LCR-2019), Warsaw, 12-14 September 2019, Book of abstracts

2018

Dan Rosén, Mats Wirén, Elena Volodina (2018): Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora, in Proceedings of CLARIN-2018 conference, 8-10 October 2018, Pisa, Italy
Beata Megyesi, Lena Granstedt, Sofia Johansson, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén, Elena Volodina (2018): Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish, in Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, 7th November 2018 / edited by Ildikó Pilán, Elena Volodina, David Alfter and Lars Borin
Elena Volodina, Maarten Janssen, Therese Lindström Tiedemann, Nives Mikelic Preradovic, Silje Karin Ragnhildstveit, Kari Tenfjord, Koenraad de Smedt (2018): Interoperability of Second Language Resources and Tools, in Proceedings of CLARIN-2018 conference
David Alfter, Lars Borin, Ildikó Pilán, Therese Lindström Tiedemann, Elena Volodina (2018): From Language Learning Platform to Infrastructure for Research on Language Learning, in Proceedings of CLARIN-2018 conference, Pisa, Italy
Elena Volodina, Lena Granstedt, Beáta Megyesi, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén (2018): Annotation of learner corpora: first SweLL insights, in Proceedings of SLTC 2018, Stockholm, October 7-9, 2018
Ildikó Pilán, Elena Volodina (2018): Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors, in Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, COLING, Santa Fe, New Mexico, USA, August 25, 2018
Mats Wirén, Arild Matsson, Dan Rosén, Elena Volodina (2018): SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora, in Selected papers from the CLARIN Annual Conference 2018, Pisa, 8-10 October 2018 / edited by Inguna Skadina, Maria Eskevich

2016

Elena Volodina, Beata Megyesi, Mats Wirén, Lena Granstedt, Julia Prentice, Monica Reichenberg, Gunlög Sundberg (2016): A Friend in Need? Research agenda for electronic Second Language infrastructure, in Proceedings of the Swedish Language Technology Conference