Table of Contents
Fetching ...

Self-Refining Language Model Anonymizers via Adversarial Distillation

Kyuyoung Kim, Hyunjun Jeon, Jinwoo Shin

TL;DR

SEAL presents a distillation-based framework that trains small language models to both anonymize text and critique their outputs, enabling fully local privacy-preserving anonymization without external models. By simulating adversarial interactions and applying supervised fine-tuning followed by direct preference optimization, SEAL enables iterative self-refinement at inference time. Empirical results on SynthPAI show that 8B SEAL models achieve privacy-utility trade-offs comparable to GPT-4 anonymizers and can surpass them after self-refinement, while maintaining high readability. The approach promises practical privacy protection with low latency and data-control benefits, and the authors release a high-quality dataset to catalyze further research in local, privacy-aware NLP systems.

Abstract

Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text introduces emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time. SEAL leverages adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are then used to distill anonymization and critique capabilities into SLMs through supervised fine-tuning and preference learning. The resulting models learn both to anonymize text and to evaluate their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy protection. These results highlight the effectiveness of our adversarial distillation framework for training SLMs as efficient anonymizers.

Self-Refining Language Model Anonymizers via Adversarial Distillation

TL;DR

SEAL presents a distillation-based framework that trains small language models to both anonymize text and critique their outputs, enabling fully local privacy-preserving anonymization without external models. By simulating adversarial interactions and applying supervised fine-tuning followed by direct preference optimization, SEAL enables iterative self-refinement at inference time. Empirical results on SynthPAI show that 8B SEAL models achieve privacy-utility trade-offs comparable to GPT-4 anonymizers and can surpass them after self-refinement, while maintaining high readability. The approach promises practical privacy protection with low latency and data-control benefits, and the authors release a high-quality dataset to catalyze further research in local, privacy-aware NLP systems.

Abstract

Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text introduces emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external models at inference time. SEAL leverages adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are then used to distill anonymization and critique capabilities into SLMs through supervised fine-tuning and preference learning. The resulting models learn both to anonymize text and to evaluate their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy protection. These results highlight the effectiveness of our adversarial distillation framework for training SLMs as efficient anonymizers.

Paper Structure

This paper contains 55 sections, 7 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of SEAL. We simulate adversarial anonymization with LLMs to generate trajectories of anonymized texts, inferred private attributes, and utility feedback. These trajectories are used to distill anonymization and critique capabilities into SLMs via supervised fine-tuning and preference learning, enabling effective anonymization through iterative self-refinement.
  • Figure 2: Sample instructions for supervised fine-tuning. Self-refining anonymizers are trained on anonymization (left), attribute inference (middle), and utility evaluation (right) tasks. By learning to both anonymize and evaluate their outputs, the models iteratively improve anonymization quality.
  • Figure 3: Privacy-utility comparison with adversarial anonymization. Llama3-8B with SEAL outperforms adversarial anonymization baselines, achieving stronger trade-offs across self-refinement iterations on the main dataset.
  • Figure 4: Privacy-utility comparison across models. Larger models, e.g., the two 8B models achieve the best trade-offs, while the 3B and 4B models attain reasonable performance but plateau earlier in self-refinement.
  • Figure 5: Number of texts with inferable PII across anonymization iterations. The slower decline in the hard eval dataset illustrates its increased difficulty. After six iterations, only 14.3% of texts in the main set remain inferable, compared to a much larger 40.4% in the hard set.