CoDeRooMor, v.01

Standardreferens

Elena Volodina, Yousuf Ali Mohammed, Therese Lindström Tiedemann (2021): CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish, i 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) Proceedings, May 31–2 June, 2021, Reykjavik, Iceland Online / Simon Dobnik, Lilja Øvrelid (Editors)

Datacitering

Volodina, Elena, Ali Mohammed, Yousuf, & Lindström Tiedemann, Therese (2021). CoDeRooMor, v.01 (uppdaterad: 2021-04-13). [Data set]. Bearbetad och distribuerad av Språkbanken. https://doi.org/10.23695/0t3q-jw74

Ytterligare sätt att citera datamängden.

Dataset för morfologistudier (ordbildningsmorfologi), Svenska L2 profil-projektet

The CoDeRooMor dataset (version 1.0) contains 16 230 lemgrams generated from COCTAILL (course book corpus) and SweLL-pilot (learner essay corpus) to represent vocabulary relevant for learners of Swedish as a second language, and hypothetically containing most frequent vocabulary in Swedish. The lemgrams in CoDeRooMor have been manually analysed for roots, prefixes, suffixes, infixes/binding morphemes (sv: fogemorfem) and other morpheme types, e.g. o-är-lig:

"o" prefix,
"är" root ,
"lig" suffix

The dataset represents 4 429 unique roots, 259 unique derivational suffixes, 155 unique prefixes, 12 unique binding morphemes (infixes), and a few inflectional morphemes that have been analyzed as a part of lexicalized forms or similar.

Each lemgram has an associated word formation mechanism, such as derivation, compounding, root lexeme.

Morphological annotation scheme follows principles outlined in Swedish Academy Grammar (SAG) and SAOL/SO.

Given the ”gold” nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling.

For details about the list, the annotation process and reasoning around it, see the article on CoDeRooMor.

For a short summary, see this blog.

The dataset can be downloaded in csv or excel file format. Two versions are available: organized either by morpheme or by lemgram.

This work has been supported by a grant from the Swedish Riksbankens Jubileumsfond ( Development of lexical and grammatical competences in immigrant Swedish, project P17-0716:1).

To cite this resource:
Volodina, Elena, Yousuf Ali Mohammed, and Therese Lindström Tiedemann (2021) CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoLaLiDa). NEALT Proceedings Series, no. 45, Northern European Association for Language Technology (NEALT, Linköping, pp. 178-189, Nordic Conference on Computational Linguistics, Reykjavík, Iceland, 31/05/2021. https://www.aclweb.org/anthology/2021.nodalida-main.18.pdf

Tillgänglig via

Åtkomst	Plattform	Licens
https://spraakbanken.gu.se/larkalabb/svlp		CC-BY-4.0

Ladda ned

Fil	Storlek	Modifierad	Licens
CodeRoomor_v01_lemgramView.csv	1.96 MB	2021-04-13	CC-BY-4.0
CodeRoomor_v01_morphemeView.csv	856.29 KB	2021-04-13	CC-BY-4.0
CodeRoomor_v01_lemgramView.xlsx	1.72 MB	2021-04-13	CC-BY-4.0
CodeRoomor_v01_morphemeView.xlsx	699.46 KB	2021-04-13	CC-BY-4.0

Standardreferens

Datacitering

Tillgänglig via

Ladda ned

Del av samling

Typ

Språk

Storlek

Skapad av

Uppdaterad

Kontakt

DOI