A translated corpus for classifying sentence stance in relation to a topic.
| I. IDENTIFYING INFORMATION | |
| Title* | Argumentation sentences |
| Subtitle | A translated corpus for classifying sentence stance in relation to a topic. |
| Created by* | Anna Lindahl (anna.lindahl@svenska.gu.se) |
| Publisher(s)* | Språkbanken Text (sb-info@svenska.gu.se) |
| Link(s) / permanent identifier(s)* | https://spraakbanken.gu.se/en/resources/superlim |
| License(s)* | CC BY 4.0 |
| Abstract* | Argumentation sentences is a translated corpus for the task of identifying stance in relation to a topic. It consists of sentences labeled with pro, con or non in relation to one of six topics. The original dataset [1] can be found here https://github.com/trtm/AURC. The test set is manually corrected translations, the training set is machine translated. |
| Funded by* | Vinnova (grant no. 2021-04165) |
| Cite as | |
| Related datasets | Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim) |
| II. USAGE | |
| Key applications | Machine learning, argumentation mining, stance classification |
| Intended task(s)/usage(s) | Evaluate models on the following task: Given a sentence and a topic, determine if the sentence is for, against or neutral in relation to the topic. |
| Recommended evaluation measures | Krippendorff’s alpha (the official SuperLim measure), MCC, F |
| Dataset function(s) | Training, testing |
| Recommended split(s) | Train, dev, test (provided) |
| III. DATA | |
| Primary data* | Text |
| Language* | Swedish |
| Dataset in numbers* | 5265 sentences split over 6 topics, 3450 train, 750 dev and 1065 test |
| Nature of the content* | Topics: Abortion, Death penalty, Nuclear power, Marijuana legalization, Minimum wage, Cloning. Each topic has a set of associated sentences, lableled with pro, con or non in relation to the topic. |
| Format* | Jsonl with the following keys: sentence_id = the id for each sentence, topic = the topic for each sentence, label = the label for each sentence, can be pro, con or non, sentence = the sentence itself |
| Tab-separated with 4 columns: the id for each sentence, topic = the topic for each sentence, label = the label for each sentence, can be pro, con or non, sentence = the sentence itself | |
| Data source(s)* | The original data comes from the AURC dataset [1] ( https://github.com/trtm/AURC). For this corpus, only the in-domain topics were used. |
| Data collection method(s)* | Collected from the Common Crawl archive. See [1] |
| Data selection and filtering* | A subset of the original data, only the in-domain topics are used. |
| Data preprocessing* | Sentences were machine translated. The test set was then manually corrected. |
| Data labeling* | The sentences are labeled with pro, con or non, signifying their stance in relation to a topic. |
| Annotator characteristics | |
| IV. ETHICS AND CAVEATS | |
| Ethical considerations | |
| Things to watch out for | |
| V. ABOUT DOCUMENTATION | |
| Data last updated* | 20221215 |
| Which changes have been made, compared to the previous version* | First version |
| Access to previous versions | |
| This document created* | 20221215 by Anna Lindahl |
| This document last updated* | 20220203 by Anna Lindahl |
| Where to look for further details | |
| Documentation template version* | v1.1 |
| VI. OTHER | |
| Related projects | |
| References | [1] Trautmann, D., Daxenberger, J., Stab, C., Schütze, H., & Gurevych, I. (2020, April). Fine-grained argument unit recognition and classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 9048-9056). |
