Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Chulin Xie; Zinan Lin; Arturs Backurs; Sivakanth Gopi; Da Yu; Huseyin A Inan; Harsha Nori; Haotian Jiang; Huishuai Zhang; Yin Tat Lee; Bo Li; Sergey Yekhanin

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin

TL;DR

This work extends Private Evolution (PE) to text by introducing Aug-PE, an API-based method for generating differentially private synthetic text without DP finetuning. It adds random-seed generation, private histogram-based selection, and variation via paraphrasing and fill-in-the-blanks, with adaptive text-length control to manage token costs. Across Yelp, OpenReview, and PubMed, Aug-PE achieves utility competitive with DP-fineting baselines, especially when paired with stronger LLMs like GPT-3.5, and it demonstrates improved efficiency and robustness to empirical privacy attacks. The approach provides a scalable, privacy-preserving route for DP text applications using only foundation-model APIs, with open-source code available.

Abstract

Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

TL;DR

Abstract

Paper Structure (37 sections, 2 theorems, 2 equations, 11 figures, 37 tables, 1 algorithm)

This paper contains 37 sections, 2 theorems, 2 equations, 11 figures, 37 tables, 1 algorithm.

Introduction
Background
Method
Preliminaries on Private Evolution (PE)
Aug-PE Design
Experiments
Understanding the Performance of Aug-PE
Understanding the Properties of Aug-PE
Validating the Design of Aug-PE
Conclusion
Privacy Analysis
Additional Experimental Details
Datasets and Downstream Tasks.
Implementation Details of Aug-PE.
Model and Hyperparameters
...and 22 more sections

Key Result

Theorem 1

Let $f: \mathbb{X} \rightarrow \mathbb{R}^d$ be a function with global $L_2$ sensitivity $\Delta$. For any $\varepsilon \geq 0$ and $\delta \in[0,1]$, the Gaussian output perturbation mechanism $M(x)=f(x)+Z$ with $Z \sim \mathcal{N}\left(0, \sigma^2 I\right)$ is $(\varepsilon, \delta)-D P$ if and on

Figures (11)

Figure 1: Instead of finetuning LLMs with DP-SGD to generate synthetic text, Aug-PE only requires inference APIs of LLMs. Aug-PE works with the latest open-source LLMs and API-based LLMs to generate DP synthetic text with improved utility on OpenReview dataset, where DP-SGD finetuning is either hard to implement or infeasible.
Figure 2: Overview of Aug-PE. We use two private & synthetic samples (reviews for the "restaurant" class) for illustration. Step 1 (RANDOM_API, \ref{['line:random']}): we use prompts to generate random samples from the LLM. Step 2: we iteratively go through steps 2.1-2.3 to refine the synthetic samples towards the private samples. Step 2.1 (\ref{['line:gethistogram']}): each private sample votes for their closet synthetic sample (using self-embedding \ref{['line:selfemb']} or mean embedding \ref{['line:meanemb']}) in the embedding space induced by embedding model $\Phi{}$. "A great spot for pizza" gets 2 votes, and the other sample gets 0 votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (). Step 2.2: we resample the generated texts according to the histogram. We assume that only "A great spot for pizza" remains. Step 2.3 (VARIATION_API): we use prompts to ask the LLM to generate new similar samples, which are the initial synthetic samples in the next iteration. The prompts are simplified for illustration; see \ref{['app:exp-details']} for the complete prompts.
Figure 3: Efficiency comparison between DP-FT-Generator and Aug-PE on Yelp for generating 100k synthetic samples ($\epsilon=1$)
Figure 4: GPT-3.5 with adaptive text length achieves a comparable text length distribution to the original data on Yelp.
Figure 5: Larger temperature for GPT-3.5 leads more diverse generation on Yelp with a lower FID score.
...and 6 more figures

Theorems & Definitions (3)

Theorem 1: Analytic Gaussian Mechanism balle2018improving
Theorem 2: Privacy Guarantee for \ref{['algo']}
proof : Proof Sketch

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

TL;DR

Abstract

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)