CoDeRooMor, v.01

Standard reference

Elena Volodina, Yousuf Ali Mohammed, Therese Lindström Tiedemann (2021): CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish, in 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) Proceedings, May 31–2 June, 2021, Reykjavik, Iceland Online / Simon Dobnik, Lilja Øvrelid (Editors)

Data citation

Volodina, Elena, Ali Mohammed, Yousuf, & Lindström Tiedemann, Therese (2021). CoDeRooMor, v.01 (updated: 2021-04-13). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/0t3q-jw74

Additional ways to cite the dataset.

Morphological dataset (word-building morphology), Swedish L2 profiles project

The CoDeRooMor dataset (version 1.0) contains 16 230 lemgrams generated from COCTAILL (course book corpus) and SweLL-pilot (learner essay corpus) to represent vocabulary relevant for learners of Swedish as a second language, and hypothetically containing most frequent vocabulary in Swedish. The lemgrams in CoDeRooMor have been manually analysed for roots, prefixes, suffixes, infixes/binding morphemes (sv: fogemorfem) and other morpheme types, e.g. o-är-lig:

"o" prefix,
"är" root ,
"lig" suffix

The dataset represents 4 429 unique roots, 259 unique derivational suffixes, 155 unique prefixes, 12 unique binding morphemes (infixes), and a few inflectional morphemes that have been analyzed as a part of lexicalized forms or similar.

Each lemgram has an associated word formation mechanism, such as derivation, compounding, root lexeme.

Morphological annotation scheme follows principles outlined in Swedish Academy Grammar (SAG) and SAOL/SO.

Given the ”gold” nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling.

For details about the list, the annotation process and reasoning around it, see the article on CoDeRooMor.

For a short summary, see this blog.

The dataset can be downloaded in csv or excel file format. Two versions are available: organized either by morpheme or by lemgram.

This work has been supported by a grant from the Swedish Riksbankens Jubileumsfond ( Development of lexical and grammatical competences in immigrant Swedish, project P17-0716:1).

To cite this resource:
Volodina, Elena, Yousuf Ali Mohammed, and Therese Lindström Tiedemann (2021) CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoLaLiDa). NEALT Proceedings Series, no. 45, Northern European Association for Language Technology (NEALT, Linköping, pp. 178-189, Nordic Conference on Computational Linguistics, Reykjavík, Iceland, 31/05/2021. https://www.aclweb.org/anthology/2021.nodalida-main.18.pdf

Accessible through

Access	Platform	Licence
https://spraakbanken.gu.se/larkalabb/svlp		CC-BY-4.0

Download

File	Size	Modified	Licence
CodeRoomor_v01_lemgramView.csv	1.96 MB	2021-04-13	CC-BY-4.0
CodeRoomor_v01_morphemeView.csv	856.29 KB	2021-04-13	CC-BY-4.0
CodeRoomor_v01_lemgramView.xlsx	1.72 MB	2021-04-13	CC-BY-4.0
CodeRoomor_v01_morphemeView.xlsx	699.46 KB	2021-04-13	CC-BY-4.0

Standard reference

Data citation

Accessible through

Download

Collection

Type

Language

Size

Creators

Updated

Contact

DOI