Hoppa till huvudinnehåll

Svensk analogi 2.0


Adewumi, Tosin. (2023-03-30). Svensk analogi 2.0 [Data set]. Språkbanken Text. https://doi.org/10.23695/b2m4-5y87
Ytterligare sätt att citera datamängden.
Svensk semantisk och syntaktisk likhet
Title* Swedish analogy test set v1.1
Subtitle Swedish semantic and syntactic similarity test set
Created by* Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU
Publisher(s)* Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/superlim
License(s)* CC BY 4.0
Abstract* The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).
Funded by* Vinnova (grant no. 2019-02996)
Cite as [1]
Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).
Key applications Intrinsic evaluation of Swedish word embeddings
Intended task(s)/usage(s) Given a word pair A and B and a word C, find a word D such that A is to B as C is to D (A:B::C:D)
Recommended evaluation measures Accuracy
Dataset function(s) Few-shot training ('prompting'), testing
Recommended split(s) A few-shot training set (aka 'prompt', 10%), test set (90%). The prompt was added with the GPT-like models in mind. For those models that do not need a prompt, it can be ignored.
Primary data* Text
Language* Swedish
Dataset in numbers* Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples. Those are split into 2045 train samples and 18,593 test samples. No effort was made to control the balance of syntactic and semantic samples in train and test, the split was random.
Nature of the content* Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
Format* TSV/JSONL with 5 columns/objects: four words and a category. The word to be predicted is called 'label', the given words 'pair1_element1', 'pair1_element2', and 'pair2_element1'.
Data source(s)* Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/
Data collection method(s)* Two Swedish native speakers proof-read the finished version. The inter-agreement score was calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki
Data selection and filtering* The dataset was postprocessed and corrected by Lars Borin and Aleksandrs Berdicevskis
Data preprocessing* Does not apply
Data labeling* Does not apply
Annotator characteristics Two Swedish native speakers
Ethical considerations
Things to watch out for
Data last updated* 2023-03-05, Gerlof Bouma
Which changes have been made, compared to the previous version* Minor format changes
Access to previous versions Work in progress
This document created* 2021-05-20, Tosin Adewumi
This document last updated* 2023-03-05, Gerlof Bouma
Where to look for further details [1],[2]
Documentation template version* v1.1
Related projects
References [1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281.
[2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.
Fil Storlek Modifierad Licens
an archive with the dataset in JSONL and TSV formats and the documentation sheet (zip)
178.63 KB 2023-03-30 CC BY 4.0

Del av samling

SuperLim 2


  • Korpus
  • Tränings- och utvärderingsdata




Skapad av

  • Adewumi, Tosin