Skip to main content

PGV-PII

Data citation Information

Szawerna, Maria Irena, & Suchardt, Jacob Lee. PGV-PII [Data set]. Språkbanken Text. https://doi.org/10.23695/qcqg-3613
BibTeX Additional ways to cite the dataset.
A small collection of 10 pairs of parallel texts in Swedish and English annotated with personal information categories.

This is a small corpus of 10 pairs of texts in Swedish and English annotated with personal information categories. The annotation largely follows that of the TAB corpus (https://aclanthology.org/2022.cl-4.19/). The twenty texts in total were sourced from the Parallel Global Voices corpus (https://nlp.ilsp.gr/pgv/, CC BY 4.0) and manually annotated. That corpus, in turn, had collected the texts from the Global Voices websites (https://globalvoices.org/, CC BY 3.0).

Annotation

The texts are annotated with personal information categories following the TAB guidelines (https://aclanthology.org/2022.cl-4.19/)

Intended uses

This corpus can be used to test personal information detection and labeling or generation of pseudonyms.

Accessible through

Access Platform Licence

Download

File Size Modified Licence
gv-pii.bz2
corpus Information (jsonl)
49.75 KB 2026-02-27 CC-BY-4.0

Type

  • Corpus
  • Training and evaluation data

Language

Swedish
English

Size

Tokens: 22,589
Sentences: 1,117

Keywords

  • pseudonymization
  • anonymization
  • parallel
  • news

Creators

  • Szawerna, Maria Irena
  • Suchardt, Jacob Lee

Created

2025-10-07

Contact

sb-info@svenska.gu.se