Skip to main content

SweNLI 1.0

A Swedish NLI dataset
I. IDENTIFYING INFORMATION
Title* SweNLI
Subtitle
Created by* Felix Morger (felix.morger@gu.se), Lars Borin, Aleksandrs Berdicevskis (Gothenburg University)
Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/superlim
License(s)* CC BY 4.0
Abstract* A Swedish NLI dataset. Train and dev are machine-translated from the English MNLI dataset, test is manually translated and adapted from the English Fracas dataset.
Funded by* Vinnova (grants no. 2020-02523, 2021-04165)
Cite as
Related datasets Part of the SuperLim collection. Similar to SuperGLUE diagnostic dataset.
II. USAGE
Key applications Machine Learning, Inference, Entailment, Evaluation of language models, Diagnostics
Intended task(s)/usage(s) Natural language inference.
Recommended evaluation measures Krippendorff's Alpha (the official SuperLim measure), Accuracy
Dataset function(s) Training, testing
Recommended split(s) Train, dev, test (provided)
III. DATA
Primary data* Text
Language* Swedish. Train and dev: machine-translated
Dataset in numbers* Train: 392704 items, dev: 9815 items, test: 305 items
Nature of the content* Inference problems, where a relation between a premise and a hypothesis has to be detected: entailment, neutral or contradiction.
Format* JSON Lines, with one item per line. Each item contains an id, a premise (in test, the premise may contain several sentences, but is still represented as a single item), a hypothesis and a label. The dataset is also available as a tsv with self-explanatory column names. For test, an additional file is provided where the items can be matched with the original Fracas items
Data source(s)* Train and dev: see [1]. Machine translated from English to Swedish using OPUS-MT. Test: see [2] and 'Data collection methods'.
Data collection method(s)* Train and dev: see [1]. Test: SweFracas (part of the SuperLim 1.0). The original English Fracas [2] was converted to html and edited by Bill MacCartney [3], and then automatically translated to Swedish by Peter Ljunglöf and Magdalena Siverbo [4]. The current form of the set was created by Aleksandrs Berdicevskis by merging the Swedish and English versions and removing some of the problems. Finally, Lars Borin went through all the translations, correcting and Swedifying them manually. As a result, many translations are rather liberal and diverge noticeably from the English original
Data selection and filtering* Train and dev: We keep only the mismatched validation as a dev set and do not include the matched version. We also do not include the test MNLI datasets. Test: 41 problems in the original set did not have a definite answer (different answers were possible depending on the interpretation). They were excluded.
Data preprocessing* Train and dev: see [1]. All extra column labels except for hypothesis (sentence1), premise (sentence2) have been removed for this data source. Test: SweFracas used questions (Ja/Nej/Vet ej/Jo) instead of hypotheses. Questions were semi-automatically converted to hypotheses by Aleksandrs Berdicevskis to fit the train and dev format.
Data labeling* Train and dev: see [1]. Test: Most of the labels map straightforwardly on the original English labels, with one exception: 108 (No => Neutral)
Annotator characteristics Train and dev: see [1]. Test: PhD in linguistics; native speaker of Swedish
IV. ETHICS AND CAVEATS
Ethical considerations Train and dev: see [1].
Things to watch out for Train and dev: see [1]. Remember that the data were machine-translated. Test: In the original dataset, all examples were classified by the linguistic phenomena they represent. It is not necessary that the Swedish translations follow exactly the same classification (most of them probably do, but it has not been checked).
V. ABOUT DOCUMENTATION
Data last updated* 2023-01-25
Which changes have been made, compared to the previous version* The translated MNLI and SweFracas were merged to created a complete dataset.
Access to previous versions
This document created* 2023-01-25, Felix Morger.
This document last updated* 2023-02-08, Aleksandrs Berdicevskis.
Where to look for further details
Documentation template version* v1.1
VI. OTHER
Related projects
References [1] Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
[2] Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. Technical report, Technical Report LRE 62-051 D-16, The FraCaS Consortium. ftp://ftp.cogsci.ed.ac.uk/pub/FRACAS/del16.ps.gz
[3] https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
[4] Peter Ljunglöf and Magdalena Siverbo. 2012. A bilingual treebank for the FraCas test suite. In SLTC 2012, page 53. https://gup.ub.gu.se/publication/168965?lang=en, https://gup.ub.gu.se/publication/168965?lang=en
File Size Modified Licence
swenli.zip
an archive with the dataset in JSONL and TSV formats and the documentation sheet (zip)
55.13 MB 2023-03-30 CC BY 4.0
attribution

Collection

SuperLim 2

Type

  • Corpus
  • Training and evaluation data

Language

Swedish

Size

Contact

Språkbanken
sb-info@svenska.gu.se