DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

Stephen Meisenbacher; Maulik Chevli; Juraj Vladika; Florian Matthes

DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

Stephen Meisenbacher, Maulik Chevli, Juraj Vladika, Florian Matthes

TL;DR

This work tackles privacy-preserving text processing under differential privacy by reframing text privatization as a rewriting task. It introduces DP-MLM, a framework that uses masked language models (encoder-only) to rewrite text one token at a time with context, achieving $\varepsilon$-DP per token and $n\varepsilon$-DP for a sentence via sequential composition. Empirical results on GLUE show DP-MLM often yields higher utility than prior state-of-the-art paraphrasing or prompting methods, while privacy evaluations on Trustpilot and Yelp demonstrate meaningful protection against adversarial attribute inference. The approach offers a practical, customizable alternative to decoder-based generation, with public release of the implementation and clear avenues for extending to variable-length outputs and broader base models.

Abstract

The task of text privatization using Differential Privacy has recently taken the form of $\textit{text rewriting}$, in which an input text is obfuscated via the use of generative (large) language models. While these methods have shown promising results in the ability to preserve privacy, these methods rely on autoregressive models which lack a mechanism to contextualize the private rewriting process. In response to this, we propose $\textbf{DP-MLM}$, a new method for differentially private text rewriting based on leveraging masked language models (MLMs) to rewrite text in a semantically similar $\textit{and}$ obfuscated manner. We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only MLMs provides better utility preservation at lower $\varepsilon$ levels, as compared to previous methods relying on larger models with a decoder. In addition, MLMs allow for greater customization of the rewriting mechanism, as opposed to generative approaches. We make the code for $\textbf{DP-MLM}$ public and reusable, found at https://github.com/sjmeis/DPMLM .

DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

TL;DR

-DP per token and

-DP for a sentence via sequential composition. Empirical results on GLUE show DP-MLM often yields higher utility than prior state-of-the-art paraphrasing or prompting methods, while privacy evaluations on Trustpilot and Yelp demonstrate meaningful protection against adversarial attribute inference. The approach offers a practical, customizable alternative to decoder-based generation, with public release of the implementation and clear avenues for extending to variable-length outputs and broader base models.

Abstract

The task of text privatization using Differential Privacy has recently taken the form of

, in which an input text is obfuscated via the use of generative (large) language models. While these methods have shown promising results in the ability to preserve privacy, these methods rely on autoregressive models which lack a mechanism to contextualize the private rewriting process. In response to this, we propose

, a new method for differentially private text rewriting based on leveraging masked language models (MLMs) to rewrite text in a semantically similar

obfuscated manner. We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only MLMs provides better utility preservation at lower

levels, as compared to previous methods relying on larger models with a decoder. In addition, MLMs allow for greater customization of the rewriting mechanism, as opposed to generative approaches. We make the code for

public and reusable, found at https://github.com/sjmeis/DPMLM .

Paper Structure (37 sections, 1 theorem, 11 equations, 2 figures, 8 tables, 3 algorithms)

This paper contains 37 sections, 1 theorem, 11 equations, 2 figures, 8 tables, 3 algorithms.

Introduction
Related Work
Foundations
Masked Language Modeling
Differential Privacy
Temperature Sampling as an Exponential Mechanism
Method
DP Masked Token Prediction
Rewriting Mechanism
Privacy Guarantees
Extending guarantees to a sentence
Experimental Setup
Utility Experiments
Utility Benchmarking
Model
...and 22 more sections

Key Result

Theorem 1

The proposed mechanism $M$ defined in the equation eq:mech_prob satisfies $\varepsilon$-LDP.

Figures (2)

Figure 1: An example of Differentially Private Text Rewriting using Masked Language Models (DP-MLM). The left side shows a real example without contextualization, and the right shows the same example with contextualization. As can be seen, providing a concatenated context sentence (the original sentence) guides the private rewriting process to be more semantically similar than if performed without contextualization.
Figure 2: Average Utility Loss. This graph depicts the average utility loss for a given $\varepsilon$ value across four GLUE tasks. On average, DP-MLM leads to a lower utility loss than DP-Paraphrase or DP-Prompt.

Theorems & Definitions (2)

Theorem 1
proof

DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

TL;DR

Abstract

DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)