Table of Contents
Fetching ...

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Tianze Wang, Zhaoyu Chen, Jian Du, Yingtai Xiao, Linjun Zhang, Qiang Yan

TL;DR

SecPE introduces secret protection as a targeted alternative to uniform differential privacy for privacy-preserving text synthesis. By formalizing $(\mathbf{p},\mathbf{r})$-secret protection and relaxing Gaussian DP to focus on per-secret priors, SecPE achieves tighter utility-privacy trade-offs. The framework uses Secret Clustering to create noisy representatives and Protected Evolution to select high-quality samples, reducing computational complexity from $O(MN_{\mathrm{syn}})$ to $O(KN_{\mathrm{syn}})$ while preserving reconstruction guarantees. Empirically, SecPE shows lower Fréchet Inception Distance and higher downstream task accuracy than GDP-based Aug-PE across OpenReview, PubMed, and Yelp, with less noise required for the same protection level, highlighting the practicality of secret-aware privacy for synthetic text generation.

Abstract

Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies $(\mathrm{p}, \mathrm{r})$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utility-privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Fréchet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

TL;DR

SecPE introduces secret protection as a targeted alternative to uniform differential privacy for privacy-preserving text synthesis. By formalizing -secret protection and relaxing Gaussian DP to focus on per-secret priors, SecPE achieves tighter utility-privacy trade-offs. The framework uses Secret Clustering to create noisy representatives and Protected Evolution to select high-quality samples, reducing computational complexity from to while preserving reconstruction guarantees. Empirically, SecPE shows lower Fréchet Inception Distance and higher downstream task accuracy than GDP-based Aug-PE across OpenReview, PubMed, and Yelp, with less noise required for the same protection level, highlighting the practicality of secret-aware privacy for synthetic text generation.

Abstract

Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies -secret protection, constituting a relaxation of Gaussian DP that enables tighter utility-privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Fréchet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.

Paper Structure

This paper contains 31 sections, 7 theorems, 34 equations, 7 figures, 7 tables, 3 algorithms.

Key Result

Lemma 3.2

Any $\mu$-GDP mechanism $\mathcal{A}$ provides $(\bm{p}, \bm{r})$-secret protection, where

Figures (7)

  • Figure 1: The overall of SecPE. The framework consists of two modules: (1) Secret Clustering: clustering is applied to public data and updated with noisy private data to form representative centers for voting; (2) Protected Evolution: in each iteration, candidate synthetic data consist of high-quality samples from the previous iteration together with their LLM-generated variations, and new high-quality samples are selected based on similarity to the noisy representatives.
  • Figure 2: Results on PubMed. (Left) FID relative to the original data for SecPE and Aug-PE under $\bm{r}/ \bm{p} \in \{2,10,50,\infty\}$ using GPT-2 and Qwen-2.5-1.5B. (Right) Synthetic sequence-length distributions for the non-private $\text{SecPE}_{3000}$ and Aug-PE generated by GPT-2 and Qwen-2.5-1.5B, compared with the original data.
  • Figure 3: Noise ratio $\sigma_{\text{GDP}}/\sigma_{\text{secret}}$ comparing $(\bm{p},\bm{r})$-secret protection with Gaussian DP. (a): $N=8000$, $m=400$, varying $r/p$. (b): $N=8000$, $r/p=10$, varying the number of secrets $m$.
  • Figure 4: Voting distribution per label on Yelp. Top: raw votes from Aug-PE. Bottom: votes after clustering in SecPE.
  • Figure 5: FID and sequence-length distributions on OpenReview.
  • ...and 2 more figures

Theorems & Definitions (19)

  • Definition 2.1: $(\epsilon,\delta)$-DP
  • Definition 2.2: GDP
  • Definition 3.1: Secret Protection
  • Lemma 3.2
  • proof
  • Theorem 1: Secret Clustering
  • Theorem 2: Privacy Guarantee for Algorithm \ref{['alg:secpe']}
  • Remark 3.3
  • Theorem 3: Naive Composition
  • proof
  • ...and 9 more