Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative Study
Dimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou, Sotirios K. Goudos, Konstantinos E. Psannis, Nikoleta Karditsioti, Theocharis Saoulidis, Panagiotis Sarigiannidis
TL;DR
This paper addresses text anonymisation for privacy-preserving data sharing by comparing CRF, LSTM, ELMo, and Transformers on NER-based anonymisation tasks. Using a dedicated NER dataset with per-token annotations, the study evaluates each architecture's ability to identify and mask sensitive entities while preserving data utility. Results indicate CRF and Transformers achieve the highest overall performance (F1 ~0.93), with LSTM close behind and ELMo lagging on strict anonymisation criteria. The work highlights the potential of transformer-based approaches in contemporary anonymisation settings and suggests integrating strengths of multiple models for robust protection.
Abstract
In the digital era, with escalating privacy concerns, it's imperative to devise robust strategies that protect private data while maintaining the intrinsic value of textual information. This research embarks on a comprehensive examination of text anonymisation methods, focusing on Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Embeddings from Language Models (ELMo), and the transformative capabilities of the Transformers architecture. Each model presents unique strengths since LSTM is modeling long-term dependencies, CRF captures dependencies among word sequences, ELMo delivers contextual word representations using deep bidirectional language models and Transformers introduce self-attention mechanisms that provide enhanced scalability. Our study is positioned as a comparative analysis of these models, emphasising their synergistic potential in addressing text anonymisation challenges. Preliminary results indicate that CRF, LSTM, and ELMo individually outperform traditional methods. The inclusion of Transformers, when compared alongside with the other models, offers a broader perspective on achieving optimal text anonymisation in contemporary settings.
