DP-BART for Privatized Text Rewriting under Local Differential Privacy
Timour Igamberdiev, Ivan Habernal
TL;DR
DP-BART tackles privatized text rewriting under local differential privacy by integrating clipping by value, neuron pruning, and additional noisy training to substantially reduce the noise required for privacy guarantees. Built on a pre-trained BART model, the approach mitigates the extreme sensitivity and resource demands that hinder transformer-based LDP, achieving improved downstream task performance under privacy constraints, especially with the DP-BART-PR+ variant. Experiments across five datasets reveal that pruning-driven dimensionality reduction yields meaningful privacy/utility gains, though the strict text-adjacency in LDP still imposes notable utility losses at low privacy budgets. The work highlights practical implications for privacy-utility trade-offs in text rewriting, discusses limitations such as domain mismatch and adjacency-induced noise, and points to future directions like larger-scale pre-training and domain-specific privacy settings.
Abstract
Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.
