Swedish derivational morphology with CoDeRooMor

This blog is based on a joint work by Elena Volodina, Therese Lindström Tiedemann and Yousuf Ali Mohammed within the RJ-funded project L2 profiles. Three annotators have contributed to this work: Stellan Petersson (University of Gothenburg), Beatrice Silén (University of Helsinki ) and Maisa Lauriala (University of Helsinki). Do you know how many prefixes or suffixes the Swedish language has? Which ones? Different sources state different numbers, e.g Thorell (1984) lists approx. 90 derivational suffixes and about 50 derivatonal prefixes; Hultman (2003) …

Reflektioner från SLTC 2020

Humanister exteriör

25-27 november gick den åttonde upplagan av SLTC, Swedish Language Technology Conference, av stapeln på Humanisten här i Göteborg. Eller, skulle ha gjort om inte ett visst virus satte stopp för det. Istället fick vi som alla andra ställa om till en helt digital utgåva, men det funkade det med. Vi fick ett rekord i antalet registreringar: 193 deltagare från 34 olika länder! (Majoriteten, 60%, kom dock från Sverige). Inte alla dök förstås upp – dels var registreringen gratis, och dels var …

Pseudonymization of learner essays as a way to meet GDPR requirements

This blog is based on the author’s (Elena Volodina’s) joint research with Yousuf (Samir) Ali Mohammed, Arild Matsson, Beáta Megyesi and Sandra Derbring Access to language data is an obvious prerequisite for research in digital humanities in general, and for the development of NLP-based tools in particular. However, accessible data becomes a challenging target where personal data is involved. This is very true of language learner data where tasks are often phrased so that they, directly or indirectly, elicit explicit personal information, …

Korp searches in Second Language data

Korp offers a lot of different corpus collections for various types of search (and research). Swedish as a Second Language (L2) is one of the subcategories of the language that can be studied with the help of Korp. At the moment, Korp provides access to five L2 corpora through its interface: ASU – Andraspråksutveckling SpIn – texts from the centrum for Språkintroduktion SW1203 – texts from a preparatory course for university students SweLL – Swedish Learner Language – adult-written essays from a …

Common Pitfalls in the Development of ICALL Applications

This blog is a piece of opinion where I sketch the process of developing NLP-based applications for second language learning and look at the process from the point of view of typical (mis)conceptions and challenges, as I have experienced them. Are we over-trusting the potential of NLP? Are teachers by definition reluctant to use NLP-based solutions in classrooms? How, if at all, can academic universities ensure sustainability of the developed applications? 1 Introduction Natural Language Processing (NLP) and Language Technology (LT) deal …

A multilingual annotated corpus of world’s natural language descriptions

Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, Søren Wichmann The diversity of 7000 languages of the world represents an irreplaceable and abundant resource for understanding the unique communication system of our species (Evans and Levinson, 2009). All comparison and analysis of languages departs from language descriptions — publications that contain facts about particular languages. The typical examples of this genre are grammars and dictionaries (Hammarström and Nordhoff, 2011). Until recently, language descriptions were available in paper form only, with indexes as the …

Analyzing data from the Swedish Parliament

The Swedish Parliament (Riksdagen) continuously releases open data on its website, which includes documents approved and used during parliamentary sessions as well as what each member of parliament votes during each roll call (voting session). This data can be used to gain insight on what topics members of parliament and parties discuss and vote. In the following post, I will provide some example analyses that were performed with Python, but it could be done similarly with many other programming languages with data …

What are probing tasks in NLP?

In recent years, neural network based approaches (i.e. deep learning) have been the main models for state-of-the-art systems in natural language processing, whether that is in machine translation, natural language inference, language modeling or sentiment analysis. At the same time researchers have asked themselves what kind of linguistic information these neural networks are able to capture. Answering this question is not a trivial undertaking: state-of-the-art model’s are usually multiple layers deep with non-linear transformations learned through billions of mathematical operations. The benchmarks …

Using Språkbanken corpora in NLTK

At Språkbanken we collect resources, mainly lexica and corpora, most of them in Swedish. So far we have collected Swedish corpora totalling 13 billions of words, in all kinds of genres and from all time periods. Most of our corpora are not manually annotated, and the ones that are annotated usually have only one kind of annotation (e.g., part of speech, lemmas, dependency structures, constituent structure, etc). To be able to use the same tools to analyse any corpus, we have devised …