The project Koala -- Korp's linguistic annotations -- is aimed at developing an infrastructure for text-based research with high-quality annotations.
The corpus infrastructure Korp at Språkbanken (http://spraakbanken.gu.se) contains large amounts of Swedish texts, of many different types and ages, which are used by a wide range of researchers and the general public. The texts contain linguistic enrichment, annotations, such as word classes and syntactic roles, which help filter the search results for the user. They allow us to find "am", "is", and "are" when looking for "be", and all mentions of Caesar as the object of the verb defeat while ignoring those where he is the subject, as well as to distinguish between verbal uses of bend ("to bend the iron") from nominal ones ("a bend in the road"). The quality of these annotations is crucial to get good search results, in particular to researchers who otherwise may have to look at thousands of irrelevant sentences.
The Koala project aims to enhance the annotations, which have been automatically created through well-known language technology methods. This is done by adding linguistic knowledge to the system through the many resources available at Språkbanken, and by combining the various annotation tools for lexical analysis, part-of-speech tagging, sense disambiguation, and syntactic analysis into a high-quality system where word-level and sentence level annotations inform eachother, and the system does not make decisions until it has all available information. The resulting data and tools will be freely available.
The project is financed 2014-2016 by Riksbankens jubileumsfond.
- Yvonne Adesam, Gerlof Bouma, Richard Johansson 2015. Defining the Eukalyptus forest – the Koala treebank of Swedish
- Richard Johansson, Luis Nieto Piña 2015. Combining Relational and Distributional Knowledge for Word Sense Disambiguation