Table of Contents
Fetching ...

Private Evolution Converges

Tomás González, Giulia Fanti, Aaditya Ramdas

TL;DR

This work revisits Private Evolution (PE), a training-free approach for differentially private synthetic data, and develops a realistic convergence theory that avoids prior unrealistic multiplicity assumptions. By introducing a tractable Euclidean-space variant of PE with a $D_{BL}$-projected nearest-neighbor histogram, the authors prove a worst-case $1$-Wasserstein convergence bound of the form $\mathbb{E}[W_1(\mu_S, \mu_{S_T})] \le \tilde{O}(d D \sigma^{1/d})$ under DP, with $\sigma$ tied to the Gaussian mechanism and privacy parameters. The analysis extends to Banach spaces, clarifies the relationship between PE and the Private Signed Measure Mechanism (PSMM), and shows how practical PE naturally implements a sequential version of PSMM. Empirical results on synthetic data and CIFAR-10-like tasks corroborate the theory and provide guidance on hyperparameter choices (e.g., the number of evolution steps $T$ and synthetic sample count $n_s$). The work deepens the theoretical understanding of DP synthetic data generation and offers concrete, principled settings to deploy PE effectively in practice.

Abstract

Private Evolution (PE) is a promising training-free method for differentially private (DP) synthetic data generation. While it achieves strong performance in some domains (e.g., images and text), its behavior in others (e.g., tabular data) is less consistent. To date, the only theoretical analysis of the convergence of PE depends on unrealistic assumptions about both the algorithm's behavior and the structure of the sensitive dataset. In this work, we develop a new theoretical framework to understand PE's practical behavior and identify sufficient conditions for its convergence. For $d$-dimensional sensitive datasets with $n$ data points from a convex and compact domain, we prove that under the right hyperparameter settings and given access to the Gaussian variation API proposed in \cite{PE23}, PE produces an $(\varepsilon, δ)$-DP synthetic dataset with expected 1-Wasserstein distance $\tilde{O}(d(n\varepsilon)^{-1/d})$ from the original; this establishes worst-case convergence of the algorithm as $n \to \infty$. Our analysis extends to general Banach spaces as well. We also connect PE to the Private Signed Measure Mechanism, a method for DP synthetic data generation that has thus far not seen much practical adoption. We demonstrate the practical relevance of our theoretical findings in experiments.

Private Evolution Converges

TL;DR

This work revisits Private Evolution (PE), a training-free approach for differentially private synthetic data, and develops a realistic convergence theory that avoids prior unrealistic multiplicity assumptions. By introducing a tractable Euclidean-space variant of PE with a -projected nearest-neighbor histogram, the authors prove a worst-case -Wasserstein convergence bound of the form under DP, with tied to the Gaussian mechanism and privacy parameters. The analysis extends to Banach spaces, clarifies the relationship between PE and the Private Signed Measure Mechanism (PSMM), and shows how practical PE naturally implements a sequential version of PSMM. Empirical results on synthetic data and CIFAR-10-like tasks corroborate the theory and provide guidance on hyperparameter choices (e.g., the number of evolution steps and synthetic sample count ). The work deepens the theoretical understanding of DP synthetic data generation and offers concrete, principled settings to deploy PE effectively in practice.

Abstract

Private Evolution (PE) is a promising training-free method for differentially private (DP) synthetic data generation. While it achieves strong performance in some domains (e.g., images and text), its behavior in others (e.g., tabular data) is less consistent. To date, the only theoretical analysis of the convergence of PE depends on unrealistic assumptions about both the algorithm's behavior and the structure of the sensitive dataset. In this work, we develop a new theoretical framework to understand PE's practical behavior and identify sufficient conditions for its convergence. For -dimensional sensitive datasets with data points from a convex and compact domain, we prove that under the right hyperparameter settings and given access to the Gaussian variation API proposed in \cite{PE23}, PE produces an -DP synthetic dataset with expected 1-Wasserstein distance from the original; this establishes worst-case convergence of the algorithm as . Our analysis extends to general Banach spaces as well. We also connect PE to the Private Signed Measure Mechanism, a method for DP synthetic data generation that has thus far not seen much practical adoption. We demonstrate the practical relevance of our theoretical findings in experiments.

Paper Structure

This paper contains 34 sections, 13 theorems, 77 equations, 7 figures, 5 algorithms.

Key Result

Theorem 1.1

Consider a data domain $\Omega \subset \mathbb{R}^d$ with $\ell_2$ diameter $D$. For any dataset $S \in \Omega^n$ and $0<\varepsilon, \delta<1$, there exist APIs and a parameter setting such that PE (Algorithm alg:pe) is $(\varepsilon,\delta)$-DP and it outputs a synthetic dataset $S'$ satisfying where $\mu_S$ is the empirical distribution of the dataset $S$ and similar for $S'$, and $W_1$ is the

Figures (7)

  • Figure 1: High-level illustration of private evolution (PE). $S$ represents the sensitive data, shown in red. $S_t$ are the synthetic datasets, shown in blue, and are created with the variations from $V_t$ (in green) that are closest to $S$.
  • Figure 2: Impact of parameters on PE's performance. Top: performance of the last iterate of PE when run for different number of steps; 'Predicted $T$' marks the theoretically suggested value $T = 2\log(n\varepsilon)$. Bottom: same setup, but replacing $T$ by the number of synthetic samples $n_s$; 'Predicted $n_s$' marks the value of $n_s$ given in Theorem \ref{['thm:PEconvergence_euclidean']}. We repeat this for different sensitive sample sizes $n$, averaging over 100 runs. The plots illustrate the accuracy of our theoretical predictions.
  • Figure 3: We set the privacy parameters to $\varepsilon = 5, \delta = 10^{-4}$. For each $n\in \{100,200,...,600\}$, we set the hyperparameters according to Theorem \ref{['thm:PEconvergence_euclidean']} and run PE. We repeat the experiment $3$ times and report the average final FID achieved for each value of $n$, and note that it decreases with larger $n$.
  • Figure 4: We set the privacy parameters to $\varepsilon = 5, \delta = 10^{-4}$.We consider a fixed number of samples $n = 300$. Then, we set the all hyperparameters according to Theorem \ref{['thm:PEconvergence_euclidean']}, except the number of steps $T$. For each $T \in \{4,8,12,16,20\}$, we run PE $3$ times and report the average final FID achieved.
  • Figure 5: Recall from Section \ref{['sec:simulations']} that $\varepsilon = 1,\delta = 10^{-4}$. We set the number of private samples to $n = 1000$. We run PE on different initial datasets created with $\operatorname{Random\_API}$ and plot the accuracy trajectories over $2\log(n\varepsilon) = 12$ iterations. We note that the best number of steps depends heavily on the initialization from $\operatorname{Random\_API}$. The private dataset $S$ is a random dataset in $\Omega \cap \mathbb{R}^2_+$ . Left: When $S_0 = S$ (i.e., $\Gamma_0 = 0$, the private data is the same as the initial synthetic data), the optimal number of PE steps is 0. Middle: When PE is initialized poorly (e.g., $S_0$ consists of only $(0,0)$, so $\Gamma_0$ is large), more iterations are needed. Right: Interpolating between the previous cases parametrized by $\beta$: $S_0 = (1-2\beta)S$, PE can improve or degrade performance depending on $\operatorname{Random\_API}$; it never exceeds the worst-case error bound. Results are averaged over 100 runs. Our analysis explains this phenomenon (see text in Section \ref{['sec:simulations']}).
  • ...and 2 more figures

Theorems & Definitions (34)

  • Theorem 1.1: Convergence of PE (Informal)
  • Definition 2.1: Differential Privacy dwork2006:calibrating
  • Definition 2.2: $1$-Wasserstein distance villani2008optimal
  • Definition 2.3: DP Synthetic Data
  • Lemma 3.1: Lower bound for $\eta$-closeness
  • Remark 3.1
  • Theorem 4.1: Convergence of PE
  • proof : Proof Sketch of \ref{['thm:PEconvergence_euclidean']}
  • Remark 4.1
  • Proposition 4.1
  • ...and 24 more