SweLL release, 26-27 August 2021, online

   Tillbaka till startsidan     sweclarin logo     

Release: SweLL - research infrastructure for Swedish as a Second Language

The event is supported by the Nationella Språkbanken, Swe-Clarin and Learner Corpus Association. The major project funding came from Riksbanken Jubileumsfond.

Additional financial support came from Språkbanken Text (VR) and Stockholm University. 


The SweLL corpus is a newly compiled and manually annotated corpus of essays written by adult learners of Swedish as a second language. Essays have been manually pseudonymized,  normalized, and correction annotated. Various tools have been developed to support the process of annotation, annotation management and search. Welcome to the release of both the corpus and the tools that will be presented by the SweLL project team!

Registration information: please, register for the event. (Registration closed)

Zoom webinar: link mailed to the registered participants

Preliminary program

Please note that all time indications are CET

  AUGUST, 26, 2021
13:00 - 13:35 Elena Volodina. SweLL: introductory overview [slides] [video, 26 min]
13:35 - 14:05 Beáta Megyesi. Pseudonymization in SweLL [slides]
14.05 - 14.30 Coffee break
14:30 - 15:00 Gunlög Sundberg. Selection of data and metadata: balance and representativity [slides] [video, 11 min]
15:00 - 15:30 Julia Prentice and Mats Wirén. Tutorial 1. Searches in L2 Korp and SVALA: demo, access, registration [video, 23 min. Note - scroll 10 secs in the beginning!]
15:30 - 16:00 Coffee break
16:00 - 17:00 Brian MacWhinney. (Invited talk) TalkBank Learner Corpora and Programs [slides] [video, 50 min]


  AUGUST, 27, 2021
09:00 - 10:00 Tove Larsson. (Invited talk) Manual data coding in Learner Corpus Research: Challenges and solutions. [video, 37 min]
10:00 - 10:15 Coffee break
10:15 - 10:45 Gunlög Sundberg. Normalization of the SweLL essays [slides] [video, 21 min]
10:45 - 11:15 Lisa Rudebeck. Correction annotation of the SweLL essays [slides] [video, 29 min]
11:15 - 11:30 Coffee break
11.30 - 12.00 Samir Ali Mohammed. SweLL portal, statistics and derivative datasets [slides] [video, 20 min]
12:00 - 12:30 Mats Wirén and Lena Granstedt. Tutorial 2. Searches in L2 Korp and Svala [video, 26 min]


Invited speakers

Brian MacWhinney, Carnegie Mellon University, USA

Brian MacWhinney is Teresa Heinz Professor of Psychology, Computational Linguistics, and Modern Languages at Carnegie Mellon University.  He received his Ph.D. in psycholinguistics in 1974 from the University of California at Berkeley.  With Elizabeth Bates, he developed a model of first and second language processing and acquisition based on competition between item-based patterns. In 1984, he and Catherine Snow co-founded the CHILDES (Child Language Data Exchange System) Project for the computational study of child language transcript data.  This system has extended to 13 additional research areas such as aphasiology, second language learning, TBI, Conversation Analysis, developmental disfluency and others in the shape of the TalkBank Project. MacWhinney’s recent work includes studies of online learning of second language vocabulary and grammar, situationally embedded second language learning, neural network modeling of lexical development, fMRI studies of children with focal brain lesions, and ERP studies of between-language competition. He also explores the role of grammatical constructions in the marking of perspective shifting, the determination of linguistic forms across contrasting time frames, and the construction of mental models in scientific reasoning.  Recent edited books include The Handbook of Language Emergence (Wiley) and Competing Motivations in Grammar and Usage (Oxford).

Title: TalkBank Learner Corpora and Programs


The TalkBank system is designed to make spoken language data available to the general research community. The system now includes 14 separate databanks. The database that deals most directly with second language learning is SLABank. However, data in other banks such as BilingBank, CHILDES, ClassBank, and CABank may also be of some interest to SLA researchers. Corpora in all of the banks are transcribed in a uniform format called CHAT which is also defined by a JSON Schema. The TalkBankDB search engine uses this schema to permit a wide variety of searches across and within all 14 databases. The CLAN programs rely on this format to compute automatic morphosyntactic analyses which are then fed into profiling commands to judge speaker fluency, vocabulary diversity, and syntactic complexity. This presentation will demonstrate the contents of the database through the TalkBank Browser that allows for direct playback of interactions over the web. A series of language learning modules at illustrates further ways in which corpora can be joined with instructional modules. Possible integrations with other CLARIN databases such as SweLL and Språkbanken will be discussed.


Tove Larsson, Uppsala University (Sweden) & Northern Arizona University (USA)

Tove Larsson received her Ph.D. in English linguistics from Uppsala University and is currently an Assistant Professor at Northern Arizona University in the US. She is also affiliated with Uppsala University (Sweden) and the Centre for English Corpus Linguistics at the University of Louvain (Belgium). She specializes in learner corpus research (specifically register variation and lexicogrammar in L2 writing) and research methods. Her work appears in journals such as the International Journal of Corpus Linguistics, Corpus Linguistics and Linguistic Theory, and the International Journal of Learner Corpus Research. Her most recent book is a co-authored volume titled Doing linguistics with a corpus: Methodological considerations for the everyday user (Egbert, Larsson, & Biber, 2020; Cambridge University Press). She is also the Principal Investigator for an international research project on research ethics (Larsson, Plonsky, Sterling, Kytö, & Hirschi, 2021-2023), funded by the Bank of Sweden Tercentenary Foundation and the Swedish Research Council.

Title: Manual data coding in Learner Corpus Research: Challenges and solutions


Researchers in Learner Corpus Research (LCR) commonly carry out manual coding of their data. We may, for example, manually classify our tokens into syntactic and/or functional categories. One example of categories that are often coded manually is predictors of particle placement (e.g., type of direct object, animacy and semantics of the verb; Paquot, Grafmiller, & Szmrecsanyi, 2019). Another example is semantic classifications of constructions (e.g., subject extraposition), using semantic categories such as ‘Hedges’ and ‘Attitude markers’ (e.g., Hedge: it seems that; Attitude marker: it is interesting that; e.g. Larsson, 2018). However, such coding will inevitably introduce measurement error stemming from inconsistencies and inaccuracies that we, as fallible human researchers, introduce, thereby posing a threat to the internal validity of our research. Manual coding might be compromised for a variety of reasons, both systematic (e.g., due to ambiguity of the coding scheme, inadequate coder expertise or training and/or coder bias) and random (e.g., due to coder fatigue and/or typing errors); all of these will introduce inaccuracies into our results.
In this talk, I will discuss the ubiquity of manual coding in the field and how we can work toward increased reliability of our results. Specifically, using examples from a collaborative LCR project on adverb placement, I will focus on how tests of inter-rater reliability can be used to identify issues and initiate discussion of how to improve instrument design, piloting, and evaluation. I also suggest some ways forward to encourage increased transparency in reporting practices.


See Calendar event