Koala – Korp's linguistic annotations

The project Koala -- Korp's linguistic annotations -- was aimed at developing an infrastructure for text-based research with high-quality annotations.

The corpus infrastructure Korp at Språkbanken (http://spraakbanken.gu.se) contains large amounts of Swedish texts, of many different types and ages, which are used by a wide range of researchers and the general public. The texts contain linguistic enrichment, annotations, such as word classes and syntactic roles, which help filter the search results for the user. They allow us to find "am", "is", and "are" when looking for "be", and all mentions of Caesar as the object of the verb defeat while ignoring those where he is the subject, as well as to distinguish between verbal uses of bend ("to bend the iron") from nominal ones ("a bend in the road"). The quality of these annotations is crucial to get good search results, in particular to researchers who otherwise may have to look at thousands of irrelevant sentences.

The Koala project aims to enhance the annotations, which have been automatically created through well-known language technology methods. This is done by adding linguistic knowledge to the system through the many resources available at Språkbanken, and by combining the various annotation tools for lexical analysis, part-of-speech tagging, sense disambiguation, and syntactic analysis into a high-quality system where word-level and sentence level annotations inform eachother, and the system does not make decisions until it has all available information. The resulting data and tools will be freely available.

The project was financed 2014-2016 by Riksbankens jubileumsfond.

Publications

2019

Yvonne Adesam, Gerlof Bouma (2019): The Koala Part-of-Speech Tagset, in Northern European Journal of Language Technology, volume 6, pages 5-41

2018

Yvonne Adesam, Gerlof Bouma, Richard Johansson, Lars Borin, Markus Forsberg (2018): The Eukalyptus Treebank of Written Swedish, in Seventh Swedish Language Technology Conference (SLTC), Stockholm, 7–9 November 2018
Yvonne Adesam, Gerlof Bouma, Richard Johansson (2018): The Koala Part-of-Speech and Morphological Tagset for Swedish, in Seventh Swedish Language Technology Conference (SLTC), Stockholm, 7-9 November, 2018

2016

Fabienne Cap, Yvonne Adesam, Lars Ahrenberg, Lars Borin, Gerlof Bouma, Markus Forsberg, Viggo Kann, Robert Östling, Aaron Smith, Mats Wirén, Joakim Nivre (2016): SWORD: Towards Cutting-Edge Swedish Word Processing, in Proceedings of the Sixth Swedish Language Technology Conference (SLTC) Umeå University, 17-18 November, 2016
Gerlof Bouma, Yvonne Adesam (2016): Multiword Annotation in the Eukalyptus Treebank of Written Swedish, in PARSEME, 6th general meeting, 7-8 April 2016, Struga, FYR Macedonia
Richard Johansson, Yvonne Adesam, Gerlof Bouma, Karin Hedberg (2016): A Multi-domain Corpus of Swedish Word Sense Annotation, in 10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, Portorož (Slovenia)

2015

Yvonne Adesam, Gerlof Bouma, Richard Johansson (2015): Multiwords, Word Senses and Multiword Senses in the Eukalyptus Treebank of Written Swedish, in Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT14), 11–12 December 2015 Warsaw, Poland, pages 3-12
Yvonne Adesam, Gerlof Bouma, Richard Johansson (2015): Defining the Eukalyptus forest – the Koala treebank of Swedish, in Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. Edited by Beáta Megyesi, pages 1-9

2014

Yvonne Adesam, Lars Borin, Gerlof Bouma, Markus Forsberg, Richard Johansson (2014): Koala – Korp’s Linguistic Annotations Developing an infrastructure for text-based research with high-quality annotations, in Proceedings of the Fifth Swedish Language Technology Conference, Uppsala, 13-14 November 2014

Koala – Korp's linguistic annotations

Publications

2019

2018

2016

2015

2014

Project duration

Project members

Funding

Project type