Swedish semantic and syntactic similarity: test set
I. IDENTIFYING INFORMATION | |
Title* | Swedish analogy test set v1.0 |
Subtitle | Swedish semantic and syntactic similarity test set |
Created by* | Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU |
Publisher(s)* | Språkbanken Text (sb-info@svenska.gu.se) |
Link(s) / permanent identifier(s)* | https://spraakbanken.gu.se/en/resources/analogy |
License(s)* | CC BY 4.0 |
Abstract* | The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%). |
Funded by* | Vinnova (grant no. 2019-02996) |
Cite as | [1] |
Related datasets | Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim). |
II. USAGE | |
Key applications | Intrinsic evaluation of Swedish word embeddings |
Intended task(s)/usage(s) | |
Recommended evaluation measures | |
Dataset function(s) | Testing |
Recommended split(s) | Test set only |
III. DATA | |
Primary data* | Text |
Language* | Swedish |
Dataset in numbers* | Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples |
Nature of the content* | Each sample contains 2 pairs of words. Hence, there are 4 similar words per line. |
Format* | Each sample contains 2 pairs of words. Hence, there are 4 similar words per line. |
Data source(s)* | Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/ |
Data collection method(s)* | Two Swedish native speakers proof-read the finished version and the inter-agreement score calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki |
Data selection and filtering* | Does not apply |
Data preprocessing* | Does not apply |
Data labeling* | Does not apply |
Annotator characteristics | Two Swedish native speakers |
IV. ETHICS AND CAVEATS | |
Ethical considerations | |
Things to watch out for | |
V. ABOUT DOCUMENTATION | |
Data last updated* | 2021-05-12 |
Which changes have been made, compared to the previous version* | Some linguistic errors and typos in the previous version have been corrected by Lars Borin and Aleksandrs Berdicevskis |
Access to previous versions | None |
This document created* | 2021-05-20, Tosin Adewumi |
This document last updated* | 2021-05-20, Tosin Adewumi |
Where to look for further details | [2],[1] |
Documentation template version* | v1.0 |
VI. OTHER | |
Related projects | |
References | [1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281. [2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007. |