Skip to main content
Språkbanken Text is a department within Språkbanken.

CoDeRooMor, v.01

Citation Information

Språkbanken Text (2021). CoDeRooMor, v.01 (updated: 2021-04-13). [Data set]. Språkbanken Text. https://doi.org/10.23695/0t3q-jw74
BibTeX Additional ways to cite the dataset.
Morphological dataset (word-building morphology), Swedish L2 profiles project

The CoDeRooMor dataset (version 1.0) contains 16 230 lemgrams generated from COCTAILL (course book corpus) and SweLL-pilot (learner essay corpus) to represent vocabulary relevant for learners of Swedish as a second language, and hypothetically containing most frequent vocabulary in Swedish. The lemgrams in CoDeRooMor have been manually analysed for roots, prefixes, suffixes, infixes/binding morphemes (sv: fogemorfem) and other morpheme types, e.g. o-är-lig:

  • "o" prefix,
  • "är" root ,
  • "lig" suffix

The dataset represents 4 429 unique roots, 259 unique derivational suffixes, 155 unique prefixes, 12 unique binding morphemes (infixes), and a few inflectional morphemes that have been analyzed as a part of lexicalized forms or similar.

Each lemgram has an associated word formation mechanism, such as derivation, compounding, root lexeme.

Morphological annotation scheme follows principles outlined in Swedish Academy Grammar (SAG) and SAOL/SO.

Given the ”gold” nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling.

For details about the list, the annotation process and reasoning around it, see the article on CoDeRooMor.

For a short summary, see this blog.

The dataset can be downloaded in csv or excel file format. Two versions are available: organized either by morpheme or by lemgram.

This work has been supported by a grant from the Swedish Riksbankens Jubileumsfond ( Development of lexical and grammatical competences in immigrant Swedish, project P17-0716:1).

To cite this resource:
Volodina, Elena, Yousuf Ali Mohammed, and Therese Lindström Tiedemann (2021) CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoLaLiDa). NEALT Proceedings Series, no. 45, Northern European Association for Language Technology (NEALT, Linköping, pp. 178-189, Nordic Conference on Computational Linguistics, Reykjavík, Iceland, 31/05/2021. https://www.aclweb.org/anthology/2021.nodalida-main.18.pdf

File Size Modified Licence
CodeRoomor_v01_lemgramView.csv
CodeRoomor_v01_lemgramView.csv (csv)
1.96 MB 2021-04-13 CC BY 4.0
attribution
CodeRoomor_v01_morphemeView.csv
CodeRoomor_v01_morphemeView.csv (csv)
856.29 KB 2021-04-13 CC BY 4.0
attribution
CodeRoomor_v01_lemgramView.xlsx
CodeRoomor_v01_lemgramView.xlsx (xlsx)
1.72 MB 2021-04-13 CC BY 4.0
attribution
CodeRoomor_v01_morphemeView.xlsx
CodeRoomor_v01_morphemeView.xlsx (xlsx)
699.46 KB 2021-04-13 CC BY 4.0
attribution

Type

  • Lexicon
  • Training and evaluation data

Language

Swedish

Size

Other: 4,986
Words: 16,230

Updated

2021-04-13

Contact

Språkbanken Text
sb-info@svenska.gu.se