Table of Contents
Fetching ...

A Survey on Current Trends and Recent Advances in Text Anonymization

Tobias Deußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Bauckhage, Rafet Sifa

TL;DR

This survey addresses the privacy risks of sharing textual data and regulatory pressures, proposing a comprehensive view that spans foundational NER-based anonymization to modern LLM-enabled approaches. It synthesizes domain-specific challenges (healthcare, legal, finance, education), advanced privacy-preserving methodologies (including differential privacy), and authorship anonymization, while detailing evaluation frameworks and practical toolkits. Key contributions include TAB as a standardized benchmark, analyses of LLMs as both anonymizers and potential attackers, and a roadmap for robust, resource-efficient, and attacker-aware anonymization in diverse domains. The work highlights the evolving privacy-utility trade-off and outlines practical directions for researchers and practitioners to deploy effective, trustworthy text anonymization in real-world settings.

Abstract

The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.

A Survey on Current Trends and Recent Advances in Text Anonymization

TL;DR

This survey addresses the privacy risks of sharing textual data and regulatory pressures, proposing a comprehensive view that spans foundational NER-based anonymization to modern LLM-enabled approaches. It synthesizes domain-specific challenges (healthcare, legal, finance, education), advanced privacy-preserving methodologies (including differential privacy), and authorship anonymization, while detailing evaluation frameworks and practical toolkits. Key contributions include TAB as a standardized benchmark, analyses of LLMs as both anonymizers and potential attackers, and a roadmap for robust, resource-efficient, and attacker-aware anonymization in diverse domains. The work highlights the evolving privacy-utility trade-off and outlines practical directions for researchers and practitioners to deploy effective, trustworthy text anonymization in real-world settings.

Abstract

The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.

Paper Structure

This paper contains 24 sections.