Table of Contents
Fetching ...

On the Impact of Noise in Differentially Private Text Rewriting

Stephen Meisenbacher, Maulik Chevli, Florian Matthes

TL;DR

This work investigates the impact of noise from differential privacy on text rewriting by introducing PrivFill, a sentence-infilling privatization mechanism that can operate both with and without DP guarantees. PrivFill is trained using large-scale infilling data from Wikipedia and Common Crawl and is evaluated against DP baselines DP-BART and DP-Prompt on utility tasks (arXiv, BBC, DocNLI) and privacy tasks (Trustpilot, Yelp, Enron) to quantify the privacy-utility trade-off. The results show that non-DP PrivFill generally preserves utility and semantic coherence better than DP methods, though DP approaches offer stronger empirical privacy, especially against adaptive attackers. The paper argues for rethinking the reliance on DP in NLP, highlighting the value of non-DP privatization and the need for more usable, explainable privacy mechanisms, with PrivFill released as open-source for further research.

Abstract

The field of text privatization often leverages the notion of $\textit{Differential Privacy}$ (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter $\varepsilon$. However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP, as well as the opportunities that non-DP methods present.

On the Impact of Noise in Differentially Private Text Rewriting

TL;DR

This work investigates the impact of noise from differential privacy on text rewriting by introducing PrivFill, a sentence-infilling privatization mechanism that can operate both with and without DP guarantees. PrivFill is trained using large-scale infilling data from Wikipedia and Common Crawl and is evaluated against DP baselines DP-BART and DP-Prompt on utility tasks (arXiv, BBC, DocNLI) and privacy tasks (Trustpilot, Yelp, Enron) to quantify the privacy-utility trade-off. The results show that non-DP PrivFill generally preserves utility and semantic coherence better than DP methods, though DP approaches offer stronger empirical privacy, especially against adaptive attackers. The paper argues for rethinking the reliance on DP in NLP, highlighting the value of non-DP privatization and the need for more usable, explainable privacy mechanisms, with PrivFill released as open-source for further research.

Abstract

The field of text privatization often leverages the notion of (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter . However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP, as well as the opportunities that non-DP methods present.

Paper Structure

This paper contains 45 sections, 3 figures, 9 tables, 2 algorithms.

Figures (3)

  • Figure 1: PrivFill: a sentence infilling text rewriting method that leverages generative LMs to create privatized versions of input documents. For each sentence in an input document, the sentence infilling model is tasked with finding a suitable replacement, and these replacement predictions are concatenated to form a privatized rewritten document. (Pictured: example from the Yelp dataset using PrivFill with Flan-T5-base.)
  • Figure 2: Samples from our infilling train dataset (target infilling text in green, [sep] token inserted for readability).
  • Figure 3: Cosine Similarity (CS) of rewritten texts vs. number of sentences in the original document.