Skip to main content

Grandma Karl is 27 years old: Automatic pseudonymization of research data

Background

Accessibility of research data is critical for advances in many research fields, but textual data often cannot be shared due to the presence of personal and sensitive information, e.g names, political opinions. GDPR suggests pseudonymization as a solution, but we need to learn more about it before adopting it for manipulation of research data.

Project description

This environment targets several aspects of pseudonymization, aiming to advance Sweden's work on open access to research data:  

1. algorithms to automatically detect, label and pseudonymize personal identifiers in freely written texts (essays/blogs), focusing on linguistic challenges such as spelling errors, ambiguous entities, semantic constraints etc 

2. analysis of type and number of personal identifiers versus acceptable protection, followed by reidentification tests to ensure that pseudonymization is effective 

3. analysis of the effects of pseudonymization on research data, e.g on the readability of the resulting texts, their utility for answering the intended research questions and applicability to practical scenarios (e.g language assessment) 

We will use Swedish learner-written essays, collected and manually annotated by us, and generalize to social media domain (through available corpora). Natural Language Processing, machine learning, neural networks, word embeddings are some of the methods we will work with. 

Tools and datasets will be openly shared; theoretical and methodological insights will be discussed in articles. 

Vision and planning

2023 -- employment of PhD students

2024 --

2025 --

2026 --

2027 --

2028 --

 

Departments/organizations

  • University of Gothenburg, Dpt. of Swedish, Multilinguality, Language Technology
  • University of Gothenburg, Dpt. of Philosophy, Linguistics, and Theory of Science
  • University of Umeå, Dpt. of Computer Science
  • University of Helsinki, Dpt. of Nordic languages

 

Project duration

Project members

  • Elena Volodina (Project leader)
    elena.volodina@svenska.gu.se
  • Simon Dobnik (Researcher)
    simon.dobnik@gu.se
    Institutionen för filosofi, lingvistik och vetenskapsteori
  • Xuan-Son Vu (Researcher)
    xuan-son.vu@umu.se
    Institutionen för datavetenskap, Umeå universitet
  • Therese Lindström Tiedemann (Researcher)
    therese.lindstromtiedemann@helsinki.fi
    "Finskugriska och nordiska avdelningen, Humanistiska fakulteten, Helsingfors universitet"

Funding

  • Vetenskapsrådet (2022-02311)

Research topics

  • pseudonymization
  • research data
  • språkteknologi
  • allmän lingvistik
  • svenska som andraspråk
  • pseudonymisering
  • dataintegritet
  • forskningsdata

Project type

  • Research project
  • Externally funded