SweLL-gold

Standardreferens

Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén (2019): The SweLL Language Learner Corpus: From Design to Annotation, i Northern European Journal of Language Technology, volym 6, sida 67-104

Datacitering

Volodina, Elena, Granstedt, Lena, Megyesi, Beáta, Prentice, Julia, Rudebeck, Lisa, Sundberg, Gunlög, & Wirén, Mats. SweLL-gold [Data set]. Bearbetad och distribuerad av Språkbanken. https://doi.org/10.23695/2k47-y432

Ytterligare sätt att citera datamängden.

Uppsatser skrivna av vuxenstuderande i svenska, manuellt pseudonymiserade och annoterade med felkategorier. Korpusen innehåller både originaltexten och en normaliserad version av varje uppsats. Insamlingperiod 2017-2020.

SweLL-gold corpus is a corpus of essays written by adult learners of Swedish. It was collected during the period of 2017-2020 in the SweLL project, and contains 502 essays that have been pseudonymized, normalized and correction annotated.

Two SweLL-gold versions -- original and corrected

The corpus consists of two versions that are parallel in nature, the original learner-written texts, SweLL-gold-original and corrected versions of these, SweLL-gold-target. Statistics is slightly different between these two versions due to the introduced corrections, which is why we offer download of the statistics separately for SweLL-gold-original and SweLL-gold-target.

To get access to SweLL-gold full texts, fill in an Application for Acess (see under Links).

Cite as

Annotation

Essays are manually pseudonymized, normalized and correction annotated for ortographical, lexical, morphological, syntactical and punctuation errors according to the guidelines available at [1]. Essays are also linguistically annotated (POS tagging, lemmatization, dependency annotation) with Sparv. Personal learner metadata is also available.

Förbehåll

Data collection is limited to a small geographical area and a short period of time. Although several language backgrounds are represented, the corpus is very unbalanced in this sense and as a consequence not well suited for native language identification tasks.

Avsedd användning

The corpus is primarily intended for Second Language Acquisition studies and development of Grammatical Error Correction and automatic pseudonymization systems.

Referenser

[1] Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén (2019): The SweLL Language Learner Corpus: From Design to Annotation, i Northern European Journal of Language Technology, volym 6, sida 67-104 (paper describing the corpus)
[2] Lisa Rudebeck, Gunlög Sundberg (2021): SweLL correction annotation guidelines (correction annotation guidelines)
[3] Elena Volodina (2021). SweLL-gold metadata: https://spraakbanken.github.io/swell-release-v1/Metadata-SweLL (readme)

Tillgänglig via

Åtkomst	Plattform	Licens
https://sunet.artologik.net/gu/swell		CLARIN-ID, -PRIV, -NORED, -BY (https://www.kielipankki.fi/support/clarin-eula/#res)
https://spraakbanken.gu.se/korp/?mode=default#?corpus=swellv1-original		CLARIN-ID, -PRIV, -NORED, -BY (https://www.kielipankki.fi/support/clarin-eula/#res)
https://spraakbanken.gu.se/korp/?mode=default#?corpus=swellv1-target		CLARIN-ID, -PRIV, -NORED, -BY (https://www.kielipankki.fi/support/clarin-eula/#res)

Ladda ned

Fil	Storlek	Modifierad	Licens
stats_SWELLV1-ORIGINAL.txt.zip size: tokens: 147842; sentences: 7807 (CSV)	147.52 KB	2025-04-22	CC-BY-4.0
stats_SWELLV1-TARGET.txt.zip size: tokens: 151851; sentences: 8137 (CSV)	132.13 KB	2025-04-22	CC-BY-4.0

Standardreferens

Datacitering

Two SweLL-gold versions -- original and corrected

Links

Cite as

Annotation

Förbehåll

Avsedd användning

Referenser

Tillgänglig via

Ladda ned

Del av samling

Typ

Språk

Storlek

Nyckelord

Skapad av

Kontakt

DOI