SweParaphrase 2.0

Data citation

Dannélls, Dana (2023). SweParaphrase 2.0 (updated: 2023-03-30). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/hxhx-1167

Additional ways to cite the dataset.

Semantic Textual Similarity reference data (STS Benchmark).

I. IDENTIFYING INFORMATION
Title*	SweParaphrase v2.0
Subtitle	Sentence-level semantic similarity dataset (a subset of the Swedish STS Benchmark).
Created by*	Dana Dannélls (dana.dannells@svenska.gu.se)
Publisher(s)*	Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)*	https://spraakbanken.gu.se/en/resources/sweparaphrase
License(s)*	The text of each dataset has a license of its own, as specified here
Abstract*	SweParaphrase is a sentence similarity test and training set, containing sentence pairs and their similarity scores ranging between 0 (no semantic overlap) and 5 (semantic equivalence). These sentences were automatically translated from the English STS-B data and manually corrected by a native speaker of Swedish with background in linguistics.
Funded by*	Vinnova (grants no. 2020-02523, 2021-04165)
Cite as
Related datasets	Part of the SuperLim collection . Created from the development version of the automatically translated Swedish STS Benchmark , that were translated from the English source . Similarity scores were kept unchanged.

II. USAGE
Key applications	Machine Translation, Question Answering, Information Retrieval, Text classification, Semantic parsing, Evaluation of language models.
Intended task(s)/usage(s)	Given two sentences determine how similar they are.
Recommended evaluation measures	'Krippendorff''s alpha (the official SuperLim measure), Pearson or Spearman correlation coefficients'
Dataset function(s)	Training, testing and development
Recommended split(s)	Train, dev and test (provided)

III. DATA
Primary data*	Text
Language*	Swedish
Dataset in numbers*	8592 sentence pairs; 3 genres; 9 sources.
Nature of the content*	Each pair belongs to one genre (e.g. news, forums or captions) and is linked to a file from source (e.g. headlines, answers-forums, images). The English pairs from which the Swedish sentences were translated are also included.
Format*	JSONL and TSV with the following columns/objects:
	(1) Sentence ID from the automatically translated Swedish dataset;
	(2) Genre from source (captions, news, forum);
	(3) File from source (images, headlines, answers);
	(4) and (5) manually corrected Swedish sentence pairs;
	(6) Similarity score from source (based on the English sentence pairs done by Crowdsourcing through Mechanical Turk).

Data source(s)*	The original STS benchmark comprises 8628 sentence pairs, collected from SemEval 2012 (task 6), 2014 (task 10), 2015 (task 2), 2016 (task 1), 2017 (task 1) and *SEM 2013.
Data collection method(s)*	The original English STS-B dataset taken from the SemEval shared tasks [2] was automatically translated by a master student at the MLT program at GU using Google translate API in 2022 [3]. This automatically translated version can be downloaded from Språkbanken Text [4].
Data selection and filtering*	The automatically translated STS-B from 2022 [4] was manually corrected by a graduate student with background in linguistics.
Data preprocessing*	English sentence pairs were tab-separated. Scores with decimals longer than 4 were shortened.
Data labeling*	No additional labeling was added. In the English version each sentence pair is annotated with a score (0-5), annotation was done by Crowdsourcing through Mechanical Turk. In the Swedish version we kept the scores that were assigned to the source English pairs.
Annotator characteristics	Native speaker of Swedish; fluent non-native speaker of Swedish

IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for	The similarity scores are based on the English data and are not necessarily representative for the Swedish counter parts.

V. ABOUT DOCUMENTATION
Data last updated*	2022-08-25, v2.0, Dana Dannélls
Which changes have been made, compared to the previous version*	Train and dev sets have been added
Access to previous versions	Work in progress
This document created*	2023-02-08, Dana Dannélls
This document last updated*	2023-02-03, Aleksandrs Berdicevskis
Where to look for further details	[1],[2],[3], [4]
Documentation template version*	v1.1

VI. OTHER
Related projects	Language models for Swedish authorities, Vinnova (grant no. 2019-02996) .
References	[1] Isbister, T. and Sahlgren, M. (2020): Why not simply translate? A first Swedish evaluation benchmark for semantic similarity. Proceedings of the Eighth Swedish Language Technology Conference (SLTC), University of Gothenburg. .

Download

File	Size	Modified	Licence
sweparaphrase.zip an archive with the dataset in JSONL and TSV formats and the documentation sheet (zip)	750.9 KB	2023-03-30	CC-BY-4.0

Data citation

Download

Collection

Type

Language

Size

Creators

Updated

Contact

DOI