The five lives of Talbanken

A dependency tree from Talbanken

This post is about Talbanken, one of the most widely used and important Swedish corpora. There exist at least five versions of this treebank, and the purpose of this post is to reduce ambiguity of the name ”Talbanken”, which sometimes leads to confusion. I am going to list the five versions, explain the basic differences between them and suggest unambiguous version names.

Om ordklasser för svenska språket

Ordklassindelning används i många språkteknologiska verktyg därför att det är ett sätt att skilja mellan olika användningar av ett ord. Genom ordklasserna kan man enklare söka efter liknande ord och uttryck i stora textmängder, eller skapa en ny text med liknande form. Att automatiskt dela in orden i texter i ordklasser är därmed en av de grundläggande metoderna inom artificiell intelligens, för att få datorn att förstå mänskligt språk. Människan har länge delat in ord i olika klasser eller kategorier, beroende på …

Common Pitfalls in the Development of ICALL Applications

This blog is a piece of opinion where I sketch the process of developing NLP-based applications for second language learning and look at the process from the point of view of typical (mis)conceptions and challenges, as I have experienced them. Are we over-trusting the potential of NLP? Are teachers by definition reluctant to use NLP-based solutions in classrooms? How, if at all, can academic universities ensure sustainability of the developed applications? 1 Introduction Natural Language Processing (NLP) and Language Technology (LT) deal …

En data-intensiv forskningsmetodologi 1

I en värld där AI tar en allt större plats har datadriven forskning blivit orden på allas läppar. I det här blogginlägget tänkte jag prata lite om vad det innebär att forska med hjälp av stora mängder textdata, primärt inom humaniora. Detta inlägg är det första i en serie om de olika delarna av en data-intensiv forskningsmetodologi. Jag som skriver detta inlägg heter Nina Tahmasebi och är docent inom Språkteknologi och en data scientist, en dataforskare, med över 12 års erfarenhet av …

A multilingual annotated corpus of world’s natural language descriptions

Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, Søren Wichmann The diversity of 7000 languages of the world represents an irreplaceable and abundant resource for understanding the unique communication system of our species (Evans and Levinson, 2009). All comparison and analysis of languages departs from language descriptions — publications that contain facts about particular languages. The typical examples of this genre are grammars and dictionaries (Hammarström and Nordhoff, 2011). Until recently, language descriptions were available in paper form only, with indexes as the …

Meaning through sensory data

Recently, we have seen a surge of methods that claim to embed meaning from textual corpora. But is that possible? Can text really reveal meaning, and if so, can current NLP methods detect it? Can our methods, as they some times claim, understand? Perhaps the larger question is the following: can we bring meaning to words using only the information stored in text? This question is essential for any Artificial Intelligence (AI) system that uses text as a basis. Let us take …

The Gothenburg H70 birth cohort studies and the digital assessment of neuropsychological tests

A comment often received by the reviewers of manuscripts to scientific conferences and journals is one about the representative sample under scrutiny and whether there are any solid arguments for accepting that the population characteristics, and particularly the features extracted from the empirical data acquired from such a population (e.g. from speech production) provide sufficient or accurate enough information to use in various algorithmic approaches (e.g. in machine learning). State-of-the-art studies on computational methods to identify signs of cognitive deterioration in language …