Table of Contents
Fetching ...

Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches

Dimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou, Thomai Karamitsou, Eleftherios Fountoukidis, Sotirios K. Goudos, Ioannis D. Moscholios, Konstantinos E. Psannis, Panagiotis Sarigiannidis

TL;DR

This paper tackles the problem of anonymising text while preserving utility by benchmarking multiple approaches for PII/NER tasks on the CoNLL-2003 dataset. It compares traditional sequence-labeling methods (CRF, LSTM, ELMo), transformer-based models (BERT, ELECTRA, a custom Transformer), a Microsoft Presidio tool, and an LLM (GPT-2) via fine-tuning. The study finds that a custom Transformer and CRF deliver the strongest performance, with transformer models generally outperforming traditional baselines, while Presidio and GPT-2 provide robust but slightly weaker results. The results offer practical guidance on model selection for anonymisation and suggest ensemble strategies as a promising future direction.

Abstract

In the realm of data privacy, the ability to effectively anonymise text is paramount. With the proliferation of deep learning and, in particular, transformer architectures, there is a burgeoning interest in leveraging these advanced models for text anonymisation tasks. This paper presents a comprehensive benchmarking study comparing the performance of transformer-based models and Large Language Models(LLM) against traditional architectures for text anonymisation. Utilising the CoNLL-2003 dataset, known for its robustness and diversity, we evaluate several models. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods. Notably, while modern models exhibit advanced capabilities in capturing con textual nuances, certain traditional architectures still keep high performance. This work aims to guide researchers in selecting the most suitable model for their anonymisation needs, while also shedding light on potential paths for future advancements in the field.

Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches

TL;DR

This paper tackles the problem of anonymising text while preserving utility by benchmarking multiple approaches for PII/NER tasks on the CoNLL-2003 dataset. It compares traditional sequence-labeling methods (CRF, LSTM, ELMo), transformer-based models (BERT, ELECTRA, a custom Transformer), a Microsoft Presidio tool, and an LLM (GPT-2) via fine-tuning. The study finds that a custom Transformer and CRF deliver the strongest performance, with transformer models generally outperforming traditional baselines, while Presidio and GPT-2 provide robust but slightly weaker results. The results offer practical guidance on model selection for anonymisation and suggest ensemble strategies as a promising future direction.

Abstract

In the realm of data privacy, the ability to effectively anonymise text is paramount. With the proliferation of deep learning and, in particular, transformer architectures, there is a burgeoning interest in leveraging these advanced models for text anonymisation tasks. This paper presents a comprehensive benchmarking study comparing the performance of transformer-based models and Large Language Models(LLM) against traditional architectures for text anonymisation. Utilising the CoNLL-2003 dataset, known for its robustness and diversity, we evaluate several models. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods. Notably, while modern models exhibit advanced capabilities in capturing con textual nuances, certain traditional architectures still keep high performance. This work aims to guide researchers in selecting the most suitable model for their anonymisation needs, while also shedding light on potential paths for future advancements in the field.
Paper Structure (16 sections, 5 figures, 2 tables)

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Anonymisation Pipeline
  • Figure 2: Presidio Anonymisation
  • Figure 3: Performance of Traditional Models
  • Figure 4: Performance of Transformers Models
  • Figure 5: Overall Performance of models