Hoppa till huvudinnehåll

DaLAJ-GED-Superlim 2.0

Dataset for Linguistic Acceptability Judgments (and more), v.2.0
I. IDENTIFYING INFORMATION
Title* Dalaj-ged-superlim v2.0
Subtitle
Created by* Elena Volodina, Yousuf Ali Mohammed, Språkbanken Text -- University of Gothenburg (name.surname@svenska.gu.se)
Publisher(s)* Språkbanken Text -- University of Gothenburg
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/resurser/superlim
License(s)* CC-BY 4.0
Abstract* Dalaj v2.0 is an extension of Dalaj v1 [4], covering 30 error categories used in the SweLL-gold corpus [3]. Dalaj v2 is prepared for several sentence-level classification tasks, including linguistic acceptability (whether a sentence is grammatically correct or incorrect). The dataset contains ca 20K sentences written by non-native learners of Swedish, manipulated to contain one error per sentence and repeated for every new error. Each learner-written sentence is associated with information about the level of proficiency as assessed on an essay level and with mother tongue of the writer. Each incorrect (learner-written) sentence is paired with their manually corrected versions, which due to the one-error-per-sentence principle amounts to ca 6,5K unique correct sentences repeated multiple times times. To balance the number of incorrect sentences with equivalent number of correct ones, sentences from a course book corpus COCTAILL [2] have been extracted, keeping the same distribution into beginner-intermediate-advanced levels as among the incorrect sentences. Each COCTAILL sentence contains information about the (approximate) level of the coursebook at which the text is used for teaching. The dataset is split into training-validation-test sets. The test split has been manually proofread. Note that Dalaj-ged-superlim may be different from other version of Dalaj v2, see Section III for a description of changes.
Funded by* Vinnova (grants no. 2020-02523, 2021-04165), Språkbanken Text
Cite as Currently: [1]
Related datasets Dalaj v1. Part of the SuperLim collection
II. USAGE
Key applications
Intended task(s)/usage(s) 1. Determine whether a sentence is grammatically correct (the official SuperLim task)
2. Find a text span in need of correction, if there is one
3. Determine the error type
4. Find a text span in need of correction, if there is one, and suggest a correction.
Recommended evaluation measures 'Krippendorff''s Alpha (the official SuperLim measure), F0.5, accuracy'
Dataset function(s) Training, testing
Recommended split(s) Train, dev, test (provided): 80:10:10. The test set has been manually proofread, train and dev have not.
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* train: 35,581 sentences
dev: 4,702 sentences
test: 4,371 sentences
Nature of the content* Sentences written by second language learners and corrected by experts + correct sentences from course books
Format* JSONL file with one item per file. The item contains the following objects: sentence, label (correct or incorrect) and metadata. Metadata include error span (start and stop; numeration starts at 0; the range is half-open, start=stop denotes empty span (=a token has been omitted); empty if the sentence is correct); confusion pair (incorrect span and correction; empty if the sentence is correct); error label (empty if the sentence is correct); education level, l1 (native language), data source.
Data source(s)* SweLL-gold essays
Data selection and filtering* "All SweLL-gold sentences are used, except those containing "unintelligible" markup. Sentences with "consequence" (C) labels were partly deleted, partly converted to descriptive error-labels. When preparing the Superlim version, further filtering was applied: all sentences containing the * (the origin of this token was unclear), @ in the beginning of a sentence (denotes an omitted token) or $ (unintelligible symbol) were deleted. If @ occurred not in the beginning of sentence, the symbol itself was removed, but the sentence was preserved. The annotation was adjusted accordingly."
Data preprocessing* "Sentence order has been randomized, so that full essays cannot be restored. Learner metadata was dropped (except mother tongues and proficiencly level). Essay metadata was dropped. In Dalaj2-ged, all punctuation marks had added spaces both before and after, the extra spaces are removed in the Superlim version ("detokenization")."
Data labeling* Acceptability judgment; error identification; error correction; error tags (30 detailed categories), manually assigned
Annotator characteristics second language experts / linguists
IV. ETHICS AND CAVEATS
Ethical considerations SweLL-gold corpus is under GDPR restrictions. Randomized sentences withour metadata exempt risks for reidentification, and therefore allow data to be freely shared
Things to watch out for
V. ABOUT DOCUMENTATION
Data last updated* 20230122
Which changes have been made, compared to the previous version* Extensive changes, see I and III.
Access to previous versions NA
This document created* 20230123, Elena Volodina
This document last updated* 20230208, Aleksandrs Berdicevskis
Where to look for further details forthcoming
Documentation template version* v1.1
VI. OTHER
Related projects SweLL
References [1] Julia Klezl, Yousuf Ali Mohammed, Elena Volodina. (2022). Exploring Linguistic Acceptability in Swedish Learners’ Language. Proceedings of the 11th Workshop on Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL 2022), Belgium. NEALT Proceedings Series 47. [url]
[2] Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson (2014). You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144. [pdf]
[3] Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. [pdf]
[4] Elena Volodina, Yousuf Ali Mohammed, and Julia Klezl. (2021) DaLAJ - a dataset for linguistic acceptability judgments for Swedish.Proceedings of the 10th NLP4CALL workshop. Linköping Electronic University Press, Vol. 177:3. [pdf] [an extended version on arXiv]
Fil Storlek Modifierad Licens
dalaj-ged-superlim.zip
an archive with the dataset in JSONL format and the documentation sheet (zip)
1.41 MB 2023-04-03 CC BY 4.0
attribution
dalaj-ged-tsv.zip
NB! The tsv package is not an official part of the SuperLim collection. The dataset, however, is identical. (zip)
1.15 MB 2023-05-20 CC BY 4.0
attribution
liuep197-11.pdf
Elena Volodina, Yousuf Ali Mohammad, Aleksandrs Berdicevskis, Gerlof Bouma, Joey Öhman. 2023. DaLAJ-GED - a dataset for Grammatical Error Detection tasks on Swedish (pdf)
463.74 KB 2024-01-25 CC BY 4.0
attribution

Del av samling

SuperLim 2

Typ

  • Korpus
  • Tränings- och utvärderingsdata

Språk

svenska

Storlek

Skapad av

  • Volodina, Elena
  • Mohammed, Yousuf Ali

Kontakt

Språkbanken Text
sb-info@svenska.gu.se