With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting
Stephen Meisenbacher, Florian Matthes
TL;DR
Dataset size critically influences the privacy-utility trade-off in differentially private text rewriting. The authors conduct large-scale experiments across four datasets and multiple splits using three local DP rewriting mechanisms, introducing the gamma metric and an indistinguishability test to quantify privacy and utility under realistic scale. They demonstrate that larger dataset sizes generally improve utility while maintaining or enhancing privacy protection, though the gains are mechanism- and task-dependent. The work underscores the need for scale-aware evaluation in DP NLP to enable practical deployment and provides a foundation for future benchmarks and mechanism development at scale.
Abstract
Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of text rewriting mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of dataset size, or rather, the effect of dataset size on a mechanism's efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.
