I. IDENTIFYING INFORMATION | |
Title* | SweParaphrase v1.0 |
Subtitle | Sentence-level semantic similarity dataset (a subset of the Swedish STS Benchmark). |
Created by* | Dana Dannélls (dana.dannells@svenska.gu.se) |
Publisher(s)* | Språkbanken Text (sb-info@svenska.gu.se) |
Link(s) / permanent identifier(s)* | https://spraakbanken.gu.se/en/resources/sweparaphrase |
License(s)* | CC BY 4.0 |
Abstract* | SweParaphrase is a subset of the automatically translated Swedish Semantic Textual Similarity dataset (Isbister and Sahlgren, 2020). It consists of 165 manually corrected Swedish sentence pairs paired with the original English sentences and their similarity scores ranging between 0 (no meaning overlap) and 5 (meaning equivalence). These scores were taken from the English data, they were assigned by Crowdsourcing through Mechanical Turk. Each sentence pair belongs to one genre (e.g. news, forums or captions). The task is to determine how similar two sentences are. |
Funded by* | Vinnova (grant no. 2020-02523) |
Cite as | |
Related datasets | Part of the SuperLim collection . Created from the development version of the automatically translated Swedish STS Benchmark https://github.com/timpal0l/sts-benchmark-swedish. The English source http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark. |
II. USAGE | |
Key applications | Machine Translation, Question Answering, Information Retrieval, Text classification, Semantic parsing, Evaluation of language models. |
Intended task(s)/usage(s) | Given two senetences determine how similar they are. |
Recommended evaluation measures | Pearson correlation coefficient or alternative measures. |
Dataset function(s) | Testing |
Recommended split(s) | Test data only. |
III. DATA | |
Primary data* | Text |
Language* | Swedish |
Dataset in numbers* | 165 sentence pairs; 3 genres; 9 sources. |
Nature of the content* | Each pair belongs to one genre (e.g. news, forums or captions) and is linked to a file from source (e.g. headlines, answers-forums, images). The English pairs from which the Swedish sentences were translated are also included. |
Format* |
The downloadable 'sweparaphrase-dev-165.csv' file contains 8 tab-separated columns:
|
Data source(s)* |
The original STS benchmark comprises 8628 sentence pairs, collected from SemEval 2012 (task 6), 2014 (task 10), 2015 (task 2), 2016 (task 1), 2017 (task 1) and *SEM 2013. |
Data collection method(s)* |
Isbister and Sahlgren, 2020 [1] translated the complete English STS-B http://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark#Reference. The original English set is collected from datasets from the SemEval shared tasks. |
Data selection and filtering* |
This subset is taken from the automatically translated version of STS-B. First we focused only on the development version. Second, we selected only sentences which were deemed accurate translations. |
Data preprocessing* |
English sentence pairs were tab-seperated. Large chunks of texts appearing after the full stop of the sentence were removed. Scores with decimals longer than 4 were shortened. |
Data labeling* |
No additional labeling was added. In the English version each sentence pair is annotated with a score (0-5). This annotation was done by Crowdsourcing through Mechanical Turk. Scores were assigned to the source English pairs. |
Annotator characteristics |
Native speaker of Swedish; fluent non-native speaker of Swedish. |
IV. ETHICS AND CAVEATS |
|
Ethical considerations |
|
Things to watch out for |
The similarity scores are based on the English data and are not necessarily representative for the Swedish counter parts. |
V. ABOUT DOCUMENTATION |
|
Data last updated* |
2021-05-31, v1.0 |
Which changes have been made, compared to the previous version* |
This is the first official version |
Access to previous versions |
|
This document created* |
2021-05-31, Dana Dannélls |
This document last updated* |
2021-06-07, Dana Dannélls |
Where to look for further details |
[1],[2],[3],[4] |
Documentation template version* |
v1.0 |
VI. OTHER |
|
Related projects |
Language models for Swedish authorities Vinnova (grant no. 2019-02996) |
References |
[1] Isbister, T. and Sahlgren, M. (2020): Why not simply translate? A first swedish evaluation benchmark for semantic similarity. Proceedings of the Eighth Swedish Language Technology Conference (SLTC), University of Gothenburg. https://gubox.box.com/v/SLTC-2020-paper-15. The automatically translated dataset https://svn.spraakbanken.gu.se/sb-arkiv/pub/sweparaphrase/stsb-mt-sv.zip [2] Yvonne Adesam, Aleksandrs Berdicevskis, Felix Morger (2020): SwedishGLUE – Towards a Swedish Test Set for Evaluating Natural Language Understanding Models. University of Gothenburg. https://gupea.ub.gu.se/bitstream/2077/67179/1/gupea_2077_67179_1.pdf [3] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018): GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. https://arxiv.org/pdf/1804.07461.pdf [4] Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia (2017): Semeval-2017 task1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In 11th International Workshop on Semantic Evaluations, 2017. https://www.aclweb.org/anthology/S17-2001.pdf |
Citation
Språkbanken Text (2022). SweParaphrase (updated: 2022-03-16). [Data set]. Språkbanken Text. https://doi.org/10.23695/6t6h-ss96Additional ways to cite the dataset.
A subset of the Semantic Textual Similarity reference data (STS Benchmark).
Download
File | Size | Modified | Licence |
---|---|---|---|
sweparaphrase-dev-165.tsv
sweparaphrase-dev-165.tsv
(tsv)
|
28.75 KB | 2022-03-16 |
CC BY 4.0
attribution
|
sweparaphrase_documentation.tsv
sweparaphrase_documentation.tsv
(tsv)
|
5.16 KB | 2021-09-03 |
CC BY 4.0
attribution
|