Table of Contents
Fetching ...

DP-BART for Privatized Text Rewriting under Local Differential Privacy

Timour Igamberdiev, Ivan Habernal

TL;DR

DP-BART tackles privatized text rewriting under local differential privacy by integrating clipping by value, neuron pruning, and additional noisy training to substantially reduce the noise required for privacy guarantees. Built on a pre-trained BART model, the approach mitigates the extreme sensitivity and resource demands that hinder transformer-based LDP, achieving improved downstream task performance under privacy constraints, especially with the DP-BART-PR+ variant. Experiments across five datasets reveal that pruning-driven dimensionality reduction yields meaningful privacy/utility gains, though the strict text-adjacency in LDP still imposes notable utility losses at low privacy budgets. The work highlights practical implications for privacy-utility trade-offs in text rewriting, discusses limitations such as domain mismatch and adjacency-induced noise, and points to future directions like larger-scale pre-training and domain-specific privacy settings.

Abstract

Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.

DP-BART for Privatized Text Rewriting under Local Differential Privacy

TL;DR

DP-BART tackles privatized text rewriting under local differential privacy by integrating clipping by value, neuron pruning, and additional noisy training to substantially reduce the noise required for privacy guarantees. Built on a pre-trained BART model, the approach mitigates the extreme sensitivity and resource demands that hinder transformer-based LDP, achieving improved downstream task performance under privacy constraints, especially with the DP-BART-PR+ variant. Experiments across five datasets reveal that pruning-driven dimensionality reduction yields meaningful privacy/utility gains, though the strict text-adjacency in LDP still imposes notable utility losses at low privacy budgets. The work highlights practical implications for privacy-utility trade-offs in text rewriting, discusses limitations such as domain mismatch and adjacency-induced noise, and points to future directions like larger-scale pre-training and domain-specific privacy settings.

Abstract

Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.
Paper Structure (47 sections, 3 theorems, 14 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 47 sections, 3 theorems, 14 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}^n$ be a function as in equation eqn:clv-clipping-full. The $\ell_1$ sensitivity $\Delta_1 f$ of this function is calculated as in equation eqn:laplace-clv-sens, where $C \in \mathbb{R}: C > 0$ is the clipping constant and $n \in \mathbb{N}$ is the dimensio

Figures (4)

  • Figure 1: DP-BART-CLV
  • Figure 2: Pruning and re-training procedure for the DP-BART-PR model, illustrated for one document. Each $i^{th}$ neuron from a set of indices is set to $0$ for all tokens of the encoder output vectors $z \in \mathbb{R}^{l \times d_{tok}}$. These neuron indices are the same for any document. This process is repeated iteratively until performance starts to degrade.
  • Figure 3: Downstream test $F_1$ results (macro-averaged) for each dataset, using the four model types. Lower $\varepsilon$ corresponds to better privacy. Both original and rewrite-no-dp results can be seen on the right of each graph at $\varepsilon=\infty$. The rest of the results represent the rewrite-dp setting at different $\varepsilon$ values.
  • Figure 4: Local DP (left) vs. global DP (right). In the local framework, the aggregator does not have access to the original data, with each individual applying DP to their own private data point. In the global framework, the aggregator adds DP noise to the original data, given a specific query from an analyst.

Theorems & Definitions (13)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • proof
  • proof
  • Definition A.1
  • Definition A.2
  • Definition A.3
  • ...and 3 more