Project description
Supervisors: Elena Volodina and Simon Dobnik
This PhD project is situated within the VR-funded research environment project Mormor Karl and focuses on the algorithmic part of detection and labeling of personally identifiable information (PII) in research data, and automatic pseudonym generation to replace PII.
The context for the project is set by these two papers:
- Research agenda for the field of pseudonymization and for the Mormor Karl project: Elena Volodina, Simon Dobnik, Therese Lindström Tiedemann and Xuan-Son Vu. 2023. Grandma Karl is 27 Years old – Research Agenda for Pseudonymization of Research Data. Proceedings of the 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService), Workshop on Big Data and Machine Learning with Privacy Enhancing Tech. Athens, Greece.
- Setting standards within the field of pseudonymization: Elena Volodina, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Lisa Södergård and Xuan-Son Vu. (2025). Towards shared standards for pseudonymization of research data. In Proceedings of Huminfra Conference 2025 (HiC 2025), Stockholm, 12–13 November 2025
Half-way seminar
Date: April, 20, 2026
Discussant: Niklas Zechner
Included publications (preliminary):
- Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann and Elena Volodina. 2024.Detecting Personal Identifiable Information in Swedish Learner Essays. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, St. Julian’s, Malta, 2024.
- Maria Irena Szawerna, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Xuan-Son Vu and Elena Volodina. 2024. Pseudonymization Categories across Domain Boundaries. In Proceedings of LREC-Coling 2024. Turin, Italy.
- Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, and Elena Volodina. 2025. The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling In Proceedings of the The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). Tallin, Estonia.
- Maria Irena Szawerna, David Alfter, Elena Volodina. 2025. Annotating Personal Information in Swedish Texts with SPARV. In Proceedings of the First Workshop on Natural Language Processing and Language Models for Digital Humanities. RANLP, Varna, Bulgaria.
- *Maria Irena Szawerna, Jacob Lee Suchardt. 2026 [forthcoming]. Fill-in-the-Blanks: Automatic Generation and Evaluation of Language Models’ Pseudonyms for English and Swedish Texts. In Proceedings of the Fifteenth Language Resources and Evaluation Conference. Palma de Mallorca, Spain.
- *Maria Irena Szawerna, Simon Dobnik. 2026 [forthcoming]. Birds of a Feather: Do Embedding Representations of Personal Information Flock Together?. In Proceedings of the Joint Workshop on Legal and Ethical Issues in Human Language Technologies (LEGAL2026) and Computational Approaches to Language Data Pseudonymization, Anonymization, De-identification, and Data Privacy (CALD-pseudo 2026). LREC, Palma de Mallorca, Spain.
(the texts marked with an asterisk are drafts and will be updated with the final camera-ready versions by the end of March)