This is a small corpus of 10 pairs of texts in Swedish and English annotated with personal information categories. The annotation largely follows that of the TAB corpus (https://aclanthology.org/2022.cl-4.19/). The twenty texts in total were sourced from the Parallel Global Voices corpus (https://nlp.ilsp.gr/pgv/, CC BY 4.0) and manually annotated. That corpus, in turn, had collected the texts from the Global Voices websites (https://globalvoices.org/, CC BY 3.0).
Standard reference
Maria Irena Szawerna, Jacob Lee Suchardt
(2026):
Fill-in-the-Blanks: Automatic Generation and Evaluation of Language Models' Pseudonyms for English and Swedish Texts,
in Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026),11–16 May 2026, Palma, Mallorca, Spain,
pages 1155-1169
Data citation
Szawerna, Maria Irena, & Suchardt, Jacob Lee. PGV-PII [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/qcqg-3613
Additional ways to cite the dataset.
A small collection of 10 pairs of parallel texts in Swedish and English annotated with personal information categories.
Annotation
The texts are annotated with personal information categories following the TAB guidelines (https://aclanthology.org/2022.cl-4.19/)
Intended uses
This corpus can be used to test personal information detection and labeling or generation of pseudonyms.
Download
| File | Size | Modified | Licence |
|---|---|---|---|
| 49.75 KB | 2026-02-27 | CC-BY-4.0 |