Grandma Karl is 27 years old: Automatic pseudonymization of research data
Official project website: https://mormor-karl.github.io/
Background
Accessibility of research data is critical for advances in many research fields, but textual data often cannot be shared due to the presence of personal and sensitive information, e.g names, political opinions. GDPR suggests pseudonymization as a solution, but we need to learn more about it before adopting it for manipulation of research data.
Project description
This environment targets several aspects of pseudonymization, aiming to advance Sweden's work on open access to research data:
Vision and planning
2023 -- employment of PhD students, contracts, data access, reannotation of SweLL-pilot to SweLL-gold.
2024 -- employment of PhD students, workshop at EACL 2024, CEFR-annotating SweLL-gold, development of personal information detection models.
2025 -- further model development, research on the role of LLMs for pseudonymization, reidentification experiments, collection of fictive texts.
2026 --
2027 --
2028 --
Departments/organizations
- University of Gothenburg, Dpt. of Swedish, Multilinguality, Language Technology
- University of Gothenburg, Dpt. of Philosophy, Linguistics, and Theory of Science
- Lund University, Dpt. of Computer Science
- University of Helsinki, Dpt. of Nordic languages
Publications
- Nikolai Ilinykh, Maria Irena Szawerna. 2025. “I Need More Context and an English Translation”: Analysing How LLMs Identify Personal Information in Komi, Polish, and English. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025). [pdf]
- Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, and Elena Volodina. 2025. The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling. In Proceedings of the The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). [pdf]
- Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Elena Volodina. 2024. Swedish Learner Essays Revisited: Further Insights into Detecting Personal Information. An abstract at the Tenth Swedish Language Technology Conference (SLTC), Linköping, Sweden. [pdf]
- Ricardo Muñoz Sánchez, Simon Dobnik, Therese Lindström Tiedemann, Maria Irena Szawerna and Elena Volodina. 2024. Name Biases in Automated Essay Assessment. An abstract at the 28th International Congress of Onomastic Sciences - University of Helsinki, Helsinki, Finland. [link]
- Ricardo Muñoz Sánchez, Simon Dobnik, Maria Irena Szawerna, Therese Lindström Tiedemann and Elena Volodina. 2024. Did the Names I Used within My Essay Affect My Score? Diagnosing Name Biases in Automated Essay Scoring. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology. [pdf]
- Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann and Elena Volodina. 2024. Detecting Personal Identifiable Information in Swedish Learner Essays. In Proceedings of the the EACL workshop Computational Approaches to Language Data Pseudonymization (CALD-pseudo-2024). EACL, Malta, 2024. Association for Language Technology. [pdf]
- Elena Volodina , Simon Dobnik, Therese Lindström Tiedemann, Xuan-Son Vu, David Alfter, Maria Irena Szawerna, Ricardo Muñoz Sánchez. 2024. Proceedings of the EACL workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo), Editors. EACL, Malta, 2024. Association for Language Technology. [pdf]
- Maria Irena Szawerna, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Xuan-Son Vu and Elena Volodina. 2024. Pseudonymization Categories across Domain Boundaries. In Proceedings of LREC-Coling 2024. [pdf]
- Elena Volodina, Simon Dobnik, Therese Lindström Tiedemann and Xuan-Son Vy. 2023. Grandma Karl is 27 Years old – Research Agenda for Pseudonymization of Research Data. Proceedings of the 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService), Workshop on Big Data and Machine Learning with Privacy Enhancing Tech. Athens, Greece. [pdf]