Hoppa till huvudinnehåll

SweWiC

En svensk ord-i-sammanhang utvarderingsmängd.
I. IDENTIFYING INFORMATION
Title* SweWiC v1.0
Subtitle A Swedish Word-in-Context dataset
Created by* Gerlof Bouma (gerlof.bouma@gu.se)
Publisher(s)* Språkbanken Text
Link(s) / permanent identifier(s)* https://spraakbanken.gu.se/en/resources/swewic
License(s)* CC BY 4.0
Abstract* The Swedish Word-in-Context dataset provides a benchmark for evaluating distributional models of word meaning, in particular context-sensitive/dynamic models. Constructed following the principles of the (English) Word-in-Context dataset, SweWiC consists of 1000 sentence pairs, where each sentence in a pair contains an occurence of a potentially ambiguous focus word specific to that pair. The question posed to the tested system is whether these two occurrences represent instances of the same word sense. There are 500 same-sense pairs and 500 different-sense pairs.
Funded by* Vinnova (grant no. 2019-02996)
Cite as
Related datasets Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim)
II. USAGE
Key applications Evaluation of (preferably dynamic) representations of word meaning
Intended task(s)/usage(s) For each test pair, predict if the uses of the focus word in two different contexts constitute the same sense.
Recommended evaluation measures Accuracy
Dataset function(s) Testing
Recommended split(s) Test split only
III. DATA
Primary data* Text
Language* Swedish
Dataset in numbers* 1000 test items, 500 positive, 500 negative. Constructed from 560 focus word types: 263 types occurring in one test item, 156 in two , and 142 in three. The focus words are of the following parts of speech according to SALDO: 462 nouns, 351 verbs, 143 adjectives, 31 adverbs, 9 prepositions, 3 pronouns, and 1 interjection.
Nature of the content* Pairs of sentences with one highlighted word form in each sentence, such that these highlighted forms are linked to the same base form (but not necessarily in the same paradigm!). These pairs are accompanied with an indication of whether these forms in these contexts have the same sense (meaning) or not. The lexical resource SALDO is used to supply the senses and sense distinctions.
Format* JSON Lines, with 1 test item per line. Each item is given as a pair first word in context-second word in context, and a boolean saying whether the same sense is used. A word in context is given as a string for the context, and a combination of a string and string indices to locate the focus word. Indices start at 0, and refer to the NFKC-normalized unicode string. Metadata included for each item is intended for analysis, and not for use by the sense disambiguation system.
Data source(s)* SALDO v2.3 (CC BY 4.0, https://spraakbanken.gu.se/en/resources/saldo, see also [1]) is used to provide the sense inventory.
SALDO’s morphology (CC BY 4.0, https://spraakbanken.gu.se/en/resources/saldom) to ensure that the word forms in the sentences are possible word forms for the sense(s) involved in the test item.
SALDO examples (CC BY 4.0, https://spraakbanken.gu.se/en/resources/saldoe) and Eukalyptus v0.2.0 (mixed CC licenses, inclusion of sentences in SweWiC under CC BY 4.0 with permission, https://spraakbanken.gu.se/en/resources/eukalyptus; see [2] for the sense annotation in this corpus) are used as sources of sense annotated words in context.
Data collection method(s)* (See selection and filtering.)
Data selection and filtering* In the spirit of the design principles given in [3], the test items adhere to the following restrictions:
- all focus words are potentially ambiguous, even in the same-sense test items
- a focus word type occurs at most in three items in the test set, no combination of a focus word and a context is repeated
- the instances in both contexts in a test item are of the same part of speech (SALDO in principle allows for semantic base forms that cross part of speech), and SALDO’s morphology lists the word forms used in the contexts as possible realizations of the involved senses.
Data preprocessing* None.
Data labeling* Judgements about word senses are taken from resources with manual annotation of word senses, and therefore constitute gold-standard data.
Annotator characteristics(No additional annotation, that is, beyond the annotation done in the projects creating the data sources, was done in the compilation of SweWiC.)
IV. ETHICS AND CAVEATS
Ethical considerations None to report.
Things to watch out for -
V. ABOUT DOCUMENTATION
Data last updated* 20210615, v1.0
Which changes have been made, compared to the previous version* First release of the data.
Access to previous versions First release of the data.
This document created* 20210615 Gerlof Bouma (gerlof.bouma@gu.se)
This document last updated* 20210615 Gerlof Bouma (gerlof.bouma@gu.se)
Where to look for further details -
Documentation template version* v1.0
VI. OTHER
Related projects The task and the design principles of the dataset were taken from / heavily inspired by the original (English) Word-in-Context benchmark described in [3]. See also the companion website https://pilehvar.github.io/wic/.
A description of a collection of WiCs for 12 languages (but not Swedish) is given in [4]. See also https://pilehvar.github.io/xlwic/.
References [1] Borin, Forsberg and Lönngren (2013): SALDO: a touch of yin to WordNet's yang. Language resources and evaluation 47(4), pp1191-1211. https://doi.org/10.1007/s10579-013-9233-4
[2] Johansson, Adesam, Bouma and Hedberg (2016): A Multi-domain Corpus of Swedish Word Sense Annotation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp3019–3022. https://www.aclweb.org/anthology/L16-1482.pdf
[3] Pilehvar and Camacho-Collados (2019): WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). http://dx.doi.org/10.18653/v1/N19-1128
[4] Raganato, Pasini, Camacho-Collados and Pilehvar (2020): XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). http://dx.doi.org/10.18653/v1/2020.emnlp-main.584

Kontakt

Språkbanken (sb-info@svenska.gu.se)