Table of Contents
Fetching ...

Private prediction for large-scale synthetic text generation

Kareem Amin, Alex Bie, Weiwei Kong, Alexey Kurakin, Natalia Ponomareva, Umar Syed, Andreas Terzis, Sergei Vassilvitskii

TL;DR

This work tackles generating differential-private synthetic text using large language models by focusing on private prediction rather than private fine-tuning. It introduces three key innovations: better private token selection via clipping-based exponential mechanism, avoiding repetitive prefix re-sampling through fixed disjoint batches and parallel decoding, and leveraging public predictions with sparse vector techniques to emit tokens at no privacy cost when distributions align. The approach enables generation of thousands of high-quality DP-protected data points, significantly surpassing prior private-prediction capabilities, and demonstrates improvements in in-context learning, fine-tuning, and structured-data tasks. The method offers a practical, scalable path for producing DP synthetic data suitable for downstream learning and evaluation, with clear limitations and opportunities for extending privacy guarantees and utility.

Abstract

We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees. This is in contrast to approaches that train a generative model on potentially sensitive user-supplied source data and seek to ensure the model itself is safe to release. We prompt a pretrained LLM with source data, but ensure that next-token predictions are made with differential privacy guarantees. Previous work in this paradigm reported generating a small number of examples (<10) at reasonable privacy levels, an amount of data that is useful only for downstream in-context learning or prompting. In contrast, we make changes that allow us to generate thousands of high-quality synthetic data points, greatly expanding the set of potential applications. Our improvements come from an improved privacy analysis and a better private selection mechanism, which makes use of the equivalence between the softmax layer for sampling tokens in LLMs and the exponential mechanism. Furthermore, we introduce a novel use of public predictions via the sparse vector technique, in which we do not pay privacy costs for tokens that are predictable without sensitive data; we find this to be particularly effective for structured data.

Private prediction for large-scale synthetic text generation

TL;DR

This work tackles generating differential-private synthetic text using large language models by focusing on private prediction rather than private fine-tuning. It introduces three key innovations: better private token selection via clipping-based exponential mechanism, avoiding repetitive prefix re-sampling through fixed disjoint batches and parallel decoding, and leveraging public predictions with sparse vector techniques to emit tokens at no privacy cost when distributions align. The approach enables generation of thousands of high-quality DP-protected data points, significantly surpassing prior private-prediction capabilities, and demonstrates improvements in in-context learning, fine-tuning, and structured-data tasks. The method offers a practical, scalable path for producing DP synthetic data suitable for downstream learning and evaluation, with clear limitations and opportunities for extending privacy guarantees and utility.

Abstract

We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees. This is in contrast to approaches that train a generative model on potentially sensitive user-supplied source data and seek to ensure the model itself is safe to release. We prompt a pretrained LLM with source data, but ensure that next-token predictions are made with differential privacy guarantees. Previous work in this paradigm reported generating a small number of examples (<10) at reasonable privacy levels, an amount of data that is useful only for downstream in-context learning or prompting. In contrast, we make changes that allow us to generate thousands of high-quality synthetic data points, greatly expanding the set of potential applications. Our improvements come from an improved privacy analysis and a better private selection mechanism, which makes use of the equivalence between the softmax layer for sampling tokens in LLMs and the exponential mechanism. Furthermore, we introduce a novel use of public predictions via the sparse vector technique, in which we do not pay privacy costs for tokens that are predictable without sensitive data; we find this to be particularly effective for structured data.
Paper Structure (48 sections, 6 theorems, 14 equations, 12 figures, 7 tables, 3 algorithms)

This paper contains 48 sections, 6 theorems, 14 equations, 12 figures, 7 tables, 3 algorithms.

Key Result

Theorem 1

Suppose Assumption assum:batch holds. Let $\rho = r\left(\frac{1}{2} \left(\frac{c}{s\tau}\right)^2 + \frac{2}{(s\sigma)^2}\right)$. For all $\varepsilon \ge 0$, Algorithm alg:main satisfies $(\varepsilon, \delta)$-differential privacy, where Also, for all $\delta \in (0, 1]$, Algorithm alg:main satisfies $(\varepsilon, \delta)$-differential privacy, where

Figures (12)

  • Figure 1: Algorithm \ref{['alg:main']}, visualized. An LLM receives a batch of prompts, each instructing to generate text similar to a piece of sensitive text. Synthetic text is generated token by token, by running inference on the batch in parallel. In each step, the logit vectors produced downstream of sensitive text are aggregated and sampled from with differential privacy. Every token sampled in such way incurs a privacy cost, motivating us to include an auxillary public prompt and sample from its logits when similar to the sensitive logits.
  • Figure 2: Example of $n$-shot in-context learning evaluation for synthetic data.
  • Figure 3: We sample a few hundred tokens using logits aggregation with no clipping. At each sampling step, we compute the L1 distances between the post-softmax distributions of aggregated clipped logits vs. aggregated unclipped logits, at various settings of $c$, and plot them in a histrogram. We observe less error, at lower choices of $c$ when clipping with recentering (note the $x$-axis scales).
  • Figure 4: Generation prompt for AGNews.
  • Figure 5: Generation prompt for TREC.
  • ...and 7 more figures

Theorems & Definitions (14)

  • Definition 1: dwork2006our
  • Theorem 1: Privacy of Algorithm \ref{['alg:main']}
  • Definition 2
  • Definition 3
  • Definition 4: bun2016concentrated
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 4 more