Self-Refining Language Model Anonymizers via Adversarial Distillation
Kyuyoung Kim, Hyunjun Jeon, Jinwoo Shin
TL;DR
SEAL presents a distillation-based framework that trains small language models to both anonymize text and critique their outputs, enabling fully local privacy-preserving anonymization without external models. By simulating adversarial interactions and applying supervised fine-tuning followed by direct preference optimization, SEAL enables iterative self-refinement at inference time. Empirical results on SynthPAI show that 8B SEAL models achieve privacy-utility trade-offs comparable to GPT-4 anonymizers and can surpass them after self-refinement, while maintaining high readability. The approach promises practical privacy protection with low latency and data-control benefits, and the authors release a high-quality dataset to catalyze further research in local, privacy-aware NLP systems.
Abstract
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text introduces emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time. SEAL leverages adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are then used to distill anonymization and critique capabilities into SLMs through supervised fine-tuning and preference learning. The resulting models learn both to anonymize text and to evaluate their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy protection. These results highlight the effectiveness of our adversarial distillation framework for training SLMs as efficient anonymizers.
