Towards the Anonymization of the Language Modeling

Antoine Boutet; Lucas Magnana; Juliette Sénéchal; Helain Zimmermann

Towards the Anonymization of the Language Modeling

Antoine Boutet, Lucas Magnana, Juliette Sénéchal, Helain Zimmermann

TL;DR

The paper tackles privacy concerns in sharing language models trained on sensitive medical data by proposing privacy-by-design approaches that explicitly avoid memorizing direct and indirect identifiers. It introduces two privacy-preserving fine-tuning schemes: PPmlm-bert for masked language modeling and PPclm-gpt for causal language modeling, both guided by a blacklist of identifiers built via NER and a bipartite graph analysis to enforce $k=2$ anonymity. Through experiments on N2c2 medical datasets, the authors show favorable privacy-utility tradeoffs compared to baselines including pseudonymization and differential privacy, and they quantify resilience to membership inference attacks. The work demonstrates that incorporating protection for both direct and indirect identifiers enables safer sharing of specialized clinical language models, with practical implications for GDPR/EDPB guidance and real-world medical AI deployment.

Abstract

Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models fine-tuned and specialized on sensitive data can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of language models anonymization, and thus promote their sharing. Specifically, we propose both a Masking Language Modeling (MLM) methodology to specialize a BERT-like language model, and a Causal Language Modeling (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using a medical dataset and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

Towards the Anonymization of the Language Modeling

TL;DR

anonymity. Through experiments on N2c2 medical datasets, the authors show favorable privacy-utility tradeoffs compared to baselines including pseudonymization and differential privacy, and they quantify resilience to membership inference attacks. The work demonstrates that incorporating protection for both direct and indirect identifiers enables safer sharing of specialized clinical language models, with practical implications for GDPR/EDPB guidance and real-world medical AI deployment.

Abstract

Paper Structure (32 sections, 1 equation, 17 figures)

This paper contains 32 sections, 1 equation, 17 figures.

Introduction
Background and related work
Natural Language Processing
Leveraging NLP for Privacy
Privacy Leakages
Extractable memorization
Membership inference
Mitigation strategies
Threat Model
Privacy-Preserving Language Modeling
Preprocessing: Building a Blacklist
Privacy-Preserving Masked Language Modeling
Privacy-Preserving Causal Language Modeling
Experimental Setup
Datasets
...and 17 more sections

Figures (17)

Figure 1: Masked language modeling (MLM) involves predicting words anywhere in the text and in any order, while causal language modeling (CLM) involves predicting words sequentially, from left to right.
Figure 2: Workflow used by the Hospices Civils de Lyon (HCL) for de-identifying clinical reports.
Figure 3: Detection of the identifiers: 1) a off-the-shelf NER model can detect the direct identifiers while by constructing a bipartite graph between individuals and the words used in their documents, we can easily identify indirectly identifiers (i.e., words pointed to only by one individual).
Figure 4: During its specialization, the model will mainly memorize the words that were used for language modeling masking, this memorization can be used by an adversary to infer membership.
Figure 5: Cumulative distribution of the number of identifiers (both direct and indirect ones), and words per patient: half of the patients have more than 20 indirect identifiers, and almost all patient have at least 3 indirect identifiers.
...and 12 more figures

Towards the Anonymization of the Language Modeling

TL;DR

Abstract

Towards the Anonymization of the Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (17)