Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting
Stephen Meisenbacher, Chaeeun Joy Lee, Florian Matthes
TL;DR
This paper tackles the problem of allocating a fixed DP privacy budget across tokens in differentially private text rewriting. It introduces a Budget Distribution Toolkit that combines five token-level scoring methods—Information Content, POS informativeness, Named Entity Recognition, Word Importance, and Sentence Difference—and uses a linear-constraint framework to allocate per-token budgets that sum to the total budget $\varepsilon$ via compositional DP guarantees. The authors implement two DP rewriting mechanisms (1-Diffractor and DP-BART) and evaluate privacy and utility on multiple datasets, showing that informed budget allocation often improves empirical privacy against attribute- and membership-inference attacks, at the cost of reduced utility in many settings. The work highlights the complexity of privacy in textual data and provides a foundation for future refinements in budget distribution, aiming to balance privacy protections with practical utility in real-world text sharing scenarios.
Abstract
The task of $\textit{Differentially Private Text Rewriting}$ is a class of text privatization techniques in which (sensitive) input textual documents are $\textit{rewritten}$ under Differential Privacy (DP) guarantees. The motivation behind such methods is to hide both explicit and implicit identifiers that could be contained in text, while still retaining the semantic meaning of the original text, thus preserving utility. Recent years have seen an uptick in research output in this field, offering a diverse array of word-, sentence-, and document-level DP rewriting methods. Common to these methods is the selection of a privacy budget (i.e., the $\varepsilon$ parameter), which governs the degree to which a text is privatized. One major limitation of previous works, stemming directly from the unique structure of language itself, is the lack of consideration of $\textit{where}$ the privacy budget should be allocated, as not all aspects of language, and therefore text, are equally sensitive or personal. In this work, we are the first to address this shortcoming, asking the question of how a given privacy budget can be intelligently and sensibly distributed amongst a target document. We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. In a series of privacy and utility experiments, we empirically demonstrate that given the same privacy budget, intelligent distribution leads to higher privacy levels and more positive trade-offs than a naive distribution of $\varepsilon$. Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
