# Swedish Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

- - -

- - -

### Type

Corpus, Dataset

### Authors

Nina Tahmasebi, Simon Hengchen, Dominik Schlechtweg, Barbara McGillivray, Haim Dubossarsky

### Description

This data collection contains the Swedish test data for [SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection](https://competitions.codalab.org/competitions/20948):

- a lemmatized Swedish text corpus pair (`corpus1/`, `corpus2/`)
- 31 lemmas (targets) which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)
- the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)

__Corpus 1__ (lemma version)

- based on: [Kubhist2](https://spraakbanken.gu.se/korp/?mode=kubhist)
- language: Swedish
- time covered: 1790-1830
- size: ~71 million tokens
- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
- encoding: UTF-8
- note: contains very frequent OCR errors and spelling variations

__Corpus 2__ (lemma version)

- based on: [Kubhist2](https://spraakbanken.gu.se/korp/?mode=kubhist) 
- language: Swedish
- time covered: 1895-1903
- size: ~111 million tokens 
- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
- encoding: UTF-8
- note: contains frequent OCR errors

Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (`corpus1/token/`, `corpus2/token/`). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.

The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded  by a project grant from the Swedish Research Council  (2019–2022;   dnr  2018-01184). 
It has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella språkbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.


### Reference

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. 2020. [SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection](https://competitions.codalab.org/competitions/20948). To appear in SemEval@COLING2020.

### Download

The resources are available through a CC BY (attribution) license and can be downloaded here: 
[https://zenodo.org/record/3730550(https://zenodo.org/record/3730550)


- - -
