This blog is based on the author's (Elena Volodina's) joint research with Yousuf (Samir) Ali Mohammed, Arild Matsson, Beáta Megyesi and Sandra Derbring
Access to language data is an obvious prerequisite for research in digital humanities in general, and for the development of NLP-based tools in particular. However, accessible data becomes a challenging target where personal data is involved. This is very true of language learner data where tasks are often phrased so that they, directly or indirectly, elicit explicit personal information, e.g."Describe your school" or "Introduce yourself".
The recent public debate on personal integrity has led to important changes in European legislation (Encinas et al., 2015; ENISA, 2017, 2018), as witnessed by the General Data Protection Regulation (GDPR). GDPR is a legal European document restricting use of digital data containing personal information. Among others, the GDPR focuses on handing back the ownership of personal data from software providers to private people (data subjects). However, the risks of misuse may linger despite the legislation, and the best protection would be either not to provide any personal information at all or to make sure that the software implementation encrypts, masks, hides or totally prevents any personal information to enter servers (thus preventing its potential unauthorised exploitation). Technology that could safeguard data subjects in that respect —that is, various de-identification/pseudonymization techniques, alongside with encryption, authorization, data minimization etc. — are recommended to be built into the software from the start, ensuring that the software complies with the requirement of data protection by design and by default (EU Commission, 2016, Art.25)
Since the GDPR legislation, we have seen an increased interest in the NLP community to deal with anonymization/pseudonymization by automatic de-identification of personal information, see e.g. the recently held workshop on NLP and Pseudonymisaion (Ahrenberg and Megyesi, 2019). Still, most of the literature on the topic deals with medical data, e.g. Marimon et al. (2019), with available GDPR guidelines such as "Identifiability, anonymisation and pseudonymisation" published by the Medical Research Council (MRC, 2019).
Some influential pre-GDPR work in the field has been carried out by Rock (2001) and Medlock (2006). Medlock (2006) defined anonymisation as "the task of identifying and neutralising sensitive references within a given document or set of documents". Following Medlock’s definition, anonymisation in most studies usually involves two distinct steps: first the text sequence containing personal information is identified, and then neutralized. Neutralization can be performed either by the replacement of the personal information with a placeholder, a category type of the personal information, or by another similar token belonging to the same category type.
To ensure that second language essays, that we are collecting for future research in the SweLL project, can in fact be used in other research projects, we manually pseudonymized approx 600 essays following a set of SweLL guidelines for pseudonymization with principles described in Megyesi et al. (2018) . Recently, we have also started to explore a possibility to automatize this process. Below is a summary of the progress we have so far made with the automatic pseudonymization. A full description of the experiment is available in Volodina et al. (Accepted).
We have implemented rules for detecting, labeling and pseudonymizing categories listed in the table below. For the detection and pseudonymization steps, we prepared special lists for matching the categories (names, surnames, cities, countries, languages, etc), openly available at a repository with a CC BY-NC-SA license.
To check the performance of our rules in real life, we used 280 SweLL essays that have been pseudonymized manually, and compared one mode of annotation to another (that is, manual mode of annotation against automatic). We consider manual pseudonymization a gold standard that we try to replicate automatically. The results are very encouraging, with an accuracy of the automatic pseudonymizer 0.89%, as can be seen from the table below. The most important measure in this table is the F1-score, which weighs the number of correctly labeled personal data (true positives) and the number of missed personal data points (false negatives). We list F1-scores per each category that we have targeted in the experiment. Date_digits, place (e.g. name of a bus stop), and surname have rather low F1-scores. There are different reasons for that. Among the linguistic reasons are misspellings and lack of capitalization, both pretty typical in second language writing. Even capitalized common words (e.g. hans vs Hans) can mistakenly get labeled as personal data points. There are also genre- and topic-specific reasons. For example, in evaluative or argumentative texts students are often asked to write a book review or a response to a debate article. In both cases, names and places are often used, however, they do not present any risk for identification of the author of the text, and thus need not be pseudonymized. The automatic pseudonymizer, however, keeps labeling and replacing them as well, which generates false positives. To put it simply, there are ways for improvement.
To summarize, use of pseudonymization holds two strong benefits in the research context: compliance with GDPR and permission to use data beyond the original purposes of collection. In the future, we would love to test machine learning for this problem, and to test crowdsourcing for correction of automatic pseudonymization — all this with the ultimate goal to start collecting learner essays online with a secure on-the-fly pseudonymization of learner essays.
====================
REFERENCES
- Ahrenberg, L. and B. Megyesi. 2019. In Proceedings of the Workshop on NLP and Pseudonymisation, pages 1–341.
- Encinas, L. H., Muñoz, A. M., Martínez, V. G.,Espigares, J. N., García, J. I. S., Castelluccia, C., and Bourka, A. (2015). Online privacy tools forthe general public. Towards a methodology for the evaluation of PETs for internet & mobile users. https://www.enisa.europa.eu/publications/privacy-tools-for-the-general-… (Accessed2019-11-17)
- ENISA (2017). Privacy Enhancing Technologies: Evolution and State of the Art. A Community Approach to PETs Maturity Assessment. https://www.enisa.europa.eu/publications/pets-evolution-and-state-of-th… (Accessed 2019-11-17)
- ENISA(2018). A tool on Privacy Enhancing Technologies (PETs) knowledge management and maturity assessment. https://www.enisa.europa.eu/publications/pets-maturity-tool (Accessed 2019-11-17)
- EU Commission, E. (2016). General data protection regulation. Official Journal of the European Union, 59, 1-88. https://gdpr-info.eu/ (Accessed 2019-11-19)
- Marimon, M., Gonzalez-Agirre, A., Intxaurrondo, A.,Rodríguez, M., Martin, A. L., Villegas, M., and Krallinger, M. (2019). Automatic de-identification of medical texts in Spanish: the Meddocan track, corpus, guidelines, methods and evaluation of results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
- Medlock, B. (2006). An introduction to NLP-based textual anonymisation. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC)
- Beáta Megyesi, Lena Granstedt, Sofia Johansson, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén, and Elena Volodina. (2018). Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish. In Proceedings of the 7th NLP4CALL, Swedish Language Technology Conference, SLTC 2018, pages 47–56.
- MRC, M. R. C. (2019). GDPR Guidance note 5: Identifiability, anonymisation and pseudonymisation. (Accessed 2019-11-22)
- Rock, F. E. (2001). Policy and practice in the anonymisation of linguistic data. International Journal of Corpus Linguistics, 6(1): 1–26
- Volodina E., Ali Mohammed Y., Matsson A., Derbring S. ans B. Megyesi. (Accepted). Towards Privacy by Design in Learner Corpora Research: A Case of On-the-fly Pseudonymization of Swedish Learner Essays. Proceedings of COLING-2020.