The CoDeRooMor dataset (version 1.0) contains 16 230 lemgrams generated from COCTAILL (course book corpus) and SweLL-pilot (learner essay corpus) to represent vocabulary relevant for learners of Swedish as a second language, and hypothetically containing most frequent vocabulary in Swedish. The lemgrams in CoDeRooMor have been manually analysed for roots, prefixes, suffixes, infixes/binding morphemes (sv: fogemorfem) and other morpheme types, e.g. o-är-lig:
- "o" prefix,
- "är" root ,
- "lig" suffix
The dataset represents 4 429 unique roots, 259 unique derivational suffixes, 155 unique prefixes, 12 unique binding morphemes (infixes), and a few inflectional morphemes that have been analyzed as a part of lexicalized forms or similar.
Each lemgram has an associated word formation mechanism, such as derivation, compounding, root lexeme.
Morphological annotation scheme follows principles outlined in Swedish Academy Grammar (SAG) and SAOL/SO.
Given the ”gold” nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling.
For details about the list, the annotation process and reasoning around it, see the article on CoDeRooMor.
For a short summary, see this blog.
The dataset can be downloaded in csv or excel file format. Two versions are available: organized either by morpheme or by lemgram.
This work has been supported by a grant from the Swedish Riksbankens Jubileumsfond (Development of lexical and grammatical competences in immigrant Swedish, project P17-0716:1).To cite this resource: Volodina, Elena, Yousuf Ali Mohammed, and Therese Lindström Tiedemann. 2021. CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. In The 23rd Nordic Conference on Computational Linguistics. 2021.