Argumentation sentences 1.0

A translated corpus for classifying sentence stance in relation to a topic.

I. IDENTIFYING INFORMATION
Title*	Argumentation sentences
Subtitle	A translated corpus for classifying sentence stance in relation to a topic.
Created by*	Anna Lindahl (anna.lindahl@svenska.gu.se)
Publisher(s)*	Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)*	https://spraakbanken.gu.se/en/resources/superlim
License(s)*	CC BY 4.0
Abstract*	Argumentation sentences is a translated corpus for the task of identifying stance in relation to a topic. It consists of sentences labeled with pro, con or non in relation to one of six topics. The original dataset [1] can be found here https://github.com/trtm/AURC. The test set is manually corrected translations, the training set is machine translated.
Funded by*	Vinnova (grant no. 2021-04165)
Cite as
Related datasets	Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim)

II. USAGE
Key applications	Machine learning, argumentation mining, stance classification
Intended task(s)/usage(s)	Evaluate models on the following task: Given a sentence and a topic, determine if the sentence is for, against or neutral in relation to the topic.
Recommended evaluation measures	Krippendorff’s alpha (the official SuperLim measure), MCC, F
Dataset function(s)	Training, testing
Recommended split(s)	Train, dev, test (provided)

III. DATA
Primary data*	Text
Language*	Swedish
Dataset in numbers*	5265 sentences split over 6 topics, 3450 train, 750 dev and 1065 test
Nature of the content*	Topics: Abortion, Death penalty, Nuclear power, Marijuana legalization, Minimum wage, Cloning. Each topic has a set of associated sentences, lableled with pro, con or non in relation to the topic.
Format*	Jsonl with the following keys: sentence_id = the id for each sentence, topic = the topic for each sentence, label = the label for each sentence, can be pro, con or non, sentence = the sentence itself
	Tab-separated with 4 columns: the id for each sentence, topic = the topic for each sentence, label = the label for each sentence, can be pro, con or non, sentence = the sentence itself
Data source(s)*	The original data comes from the AURC dataset [1] ( https://github.com/trtm/AURC). For this corpus, only the in-domain topics were used.
Data collection method(s)*	Collected from the Common Crawl archive. See [1]
Data selection and filtering*	A subset of the original data, only the in-domain topics are used.
Data preprocessing*	Sentences were machine translated. The test set was then manually corrected.
Data labeling*	The sentences are labeled with pro, con or non, signifying their stance in relation to a topic.
Annotator characteristics

IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for

V. ABOUT DOCUMENTATION
Data last updated*	20221215
Which changes have been made, compared to the previous version*	First version
Access to previous versions
This document created*	20221215 by Anna Lindahl
This document last updated*	20220203 by Anna Lindahl
Where to look for further details
Documentation template version*	v1.1

VI. OTHER
Related projects
References	[1] Trautmann, D., Daxenberger, J., Stab, C., Schütze, H., & Gurevych, I. (2020, April). Fine-grained argument unit recognition and classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 9048-9056).

File	Size	Modified	Licence
argumentation-sentences.zip an archive with the dataset in JSONL and TSV formats and the documentation sheet (zip)	827.04 KB	2023-03-30	CC BY 4.0 attribution

Collection

Type

Language

Size

Contact