Skip to main content

PGV-PII

Standard reference Information

Maria Irena Szawerna, Jacob Lee Suchardt (2026): Fill-in-the-Blanks: Automatic Generation and Evaluation of Language Models' Pseudonyms for English and Swedish Texts, in Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026),11–16 May 2026, Palma, Mallorca, Spain, pages 1155-1169 BibTeX

Data citation Information

Szawerna, Maria Irena, & Suchardt, Jacob Lee. PGV-PII [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/qcqg-3613
BibTeX Additional ways to cite the dataset.
A small collection of 10 pairs of parallel texts in Swedish and English annotated with personal information categories.

This is a small corpus of 10 pairs of texts in Swedish and English annotated with personal information categories. The annotation largely follows that of the TAB corpus (https://aclanthology.org/2022.cl-4.19/). The twenty texts in total were sourced from the Parallel Global Voices corpus (https://nlp.ilsp.gr/pgv/, CC BY 4.0) and manually annotated. That corpus, in turn, had collected the texts from the Global Voices websites (https://globalvoices.org/, CC BY 3.0).

Annotation

The texts are annotated with personal information categories following the TAB guidelines (https://aclanthology.org/2022.cl-4.19/)

Intended uses

This corpus can be used to test personal information detection and labeling or generation of pseudonyms.

Download

File Size Modified Licence
gv-pii.bz2
corpus Information (jsonl)
49.75 KB 2026-02-27 CC-BY-4.0

Type

  • Corpus
  • Training and evaluation data

Language

Swedish
English

Size

Tokens: 22,589
Sentences: 1,117

Keywords

  • pseudonymization
  • anonymization
  • parallel
  • news

Creators

  • Szawerna, Maria Irena
  • Suchardt, Jacob Lee

Created

2025-10-07

Contact

sb-info@svenska.gu.se