Swedish analogy test set v1.0

Data citation

Språkbanken (2021). Swedish analogy test set v1.0 (updated: 2021-05-23). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/qvy8-2076

Additional ways to cite the dataset.

Swedish semantic and syntactic similarity: test set

I. IDENTIFYING INFORMATION
Title*	Swedish analogy test set v1.0
Subtitle	Swedish semantic and syntactic similarity test set
Created by*	Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU
Publisher(s)*	Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)*	https://spraakbanken.gu.se/en/resources/analogy
License(s)*	CC BY 4.0
Abstract*	The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).
Funded by*	Vinnova (grant no. 2019-02996)
Cite as	[1]
Related datasets	Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).

II. USAGE
Key applications	Intrinsic evaluation of Swedish word embeddings
Intended task(s)/usage(s)
Recommended evaluation measures
Dataset function(s)	Testing
Recommended split(s)	Test set only

III. DATA
Primary data*	Text
Language*	Swedish
Dataset in numbers*	Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples
Nature of the content*	Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
Format*	Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
Data source(s)*	Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/
Data collection method(s)*	Two Swedish native speakers proof-read the finished version and the inter-agreement score calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki
Data selection and filtering*	Does not apply
Data preprocessing*	Does not apply
Data labeling*	Does not apply
Annotator characteristics	Two Swedish native speakers

IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for

V. ABOUT DOCUMENTATION
Data last updated*	2021-05-12
Which changes have been made, compared to the previous version*	Some linguistic errors and typos in the previous version have been corrected by Lars Borin and Aleksandrs Berdicevskis
Access to previous versions	None
This document created*	2021-05-20, Tosin Adewumi
This document last updated*	2021-05-20, Tosin Adewumi
Where to look for further details	[2],[1]
Documentation template version*	v1.0

VI. OTHER
Related projects

References	[1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281. [2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.

Download

File	Size	Modified	Licence
sv-analogi.txt	607.03 KB	2021-05-23	CC-BY-4.0
analogy_documentation_sheet.tsv	3.41 KB	2021-05-23	CC-BY-4.0

Data citation

Download

Collection

Successors

Type

Language

Size

Updated

Contact

DOI