Hoppa till huvudinnehåll
Språkbanken Text är en avdelning inom Språkbanken.

CoDeRooMor, v.01

Citering Information

Språkbanken Text (2021). CoDeRooMor, v.01 (uppdaterad: 2021-04-13). [Data set]. Språkbanken Text. https://doi.org/10.23695/0t3q-jw74
BibTeX Ytterligare sätt att citera datamängden.
Dataset för morfologistudier (ordbildningsmorfologi), Svenska L2 profil-projektet

The CoDeRooMor dataset (version 1.0) contains 16 230 lemgrams generated from COCTAILL (course book corpus) and SweLL-pilot (learner essay corpus) to represent vocabulary relevant for learners of Swedish as a second language, and hypothetically containing most frequent vocabulary in Swedish. The lemgrams in CoDeRooMor have been manually analysed for roots, prefixes, suffixes, infixes/binding morphemes (sv: fogemorfem) and other morpheme types, e.g. o-är-lig:

  • "o" prefix,
  • "är" root ,
  • "lig" suffix

The dataset represents 4 429 unique roots, 259 unique derivational suffixes, 155 unique prefixes, 12 unique binding morphemes (infixes), and a few inflectional morphemes that have been analyzed as a part of lexicalized forms or similar.

Each lemgram has an associated word formation mechanism, such as derivation, compounding, root lexeme.

Morphological annotation scheme follows principles outlined in Swedish Academy Grammar (SAG) and SAOL/SO.

Given the ”gold” nature of the resource, it is possible to use it for empirical studies as well as to develop linguistically-aware algorithms for morpheme segmentation and labeling.

For details about the list, the annotation process and reasoning around it, see the article on CoDeRooMor.

For a short summary, see this blog.

The dataset can be downloaded in csv or excel file format. Two versions are available: organized either by morpheme or by lemgram.

This work has been supported by a grant from the Swedish Riksbankens Jubileumsfond ( Development of lexical and grammatical competences in immigrant Swedish, project P17-0716:1).

To cite this resource:
Volodina, Elena, Yousuf Ali Mohammed, and Therese Lindström Tiedemann (2021) CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoLaLiDa). NEALT Proceedings Series, no. 45, Northern European Association for Language Technology (NEALT, Linköping, pp. 178-189, Nordic Conference on Computational Linguistics, Reykjavík, Iceland, 31/05/2021. https://www.aclweb.org/anthology/2021.nodalida-main.18.pdf

Tillgänglig via

Ladda ned

Fil Storlek Modifierad Licens
CodeRoomor_v01_lemgramView.csv
CodeRoomor_v01_lemgramView.csv (csv)
1.96 MB 2021-04-13 CC BY 4.0
attribution
CodeRoomor_v01_morphemeView.csv
CodeRoomor_v01_morphemeView.csv (csv)
856.29 KB 2021-04-13 CC BY 4.0
attribution
CodeRoomor_v01_lemgramView.xlsx
CodeRoomor_v01_lemgramView.xlsx (xlsx)
1.72 MB 2021-04-13 CC BY 4.0
attribution
CodeRoomor_v01_morphemeView.xlsx
CodeRoomor_v01_morphemeView.xlsx (xlsx)
699.46 KB 2021-04-13 CC BY 4.0
attribution

Typ

  • Lexikon
  • Tränings- och utvärderingsdata

Språk

svenska

Storlek

Övrigt: 4 986
Ord: 16 230

Updaterad

2021-04-13

Kontakt

Språkbanken Text
sb-info@svenska.gu.se