Table of Contents
Fetching ...

Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

Yujin Choi, Jinseong Park, Junyoung Byun, Jaewook Lee

TL;DR

The paper tackles the privacy-utility trade-off in differentially private diffusion models by introducing DP-SynGen, a framework that injects programmatically generated synthetic data into selective diffusion stages. By pre-training or partially training the coarse or cleaning phases on synthetic data and carefully scheduling where private data is used, DP-SynGen reduces the number of privacy-perturbed iterations and the overall privacy budget while preserving, or improving, sample quality. The approach is supported by toy-theory analyses and two main theorems, showing that there exists a diffusion time where synthetic data yields comparable learning outcomes to private data, enabling stage-wise substitution without privacy leakage. Empirically, DP-SynGen variants (Coarse, Cleaning, and FineTune) demonstrate competitive or superior FID and CAS scores on MNIST, Fashion-MNIST, and CelebA datasets, particularly under tight privacy budgets, with Dead-leaves synthetic data aiding cleaning stages and elbow-point thresholds guiding stage allocation. Overall, the work provides a practical, theoretically grounded path to enhancing differentially private generative diffusion models using synthetic data, potentially broadening the applicability of private generative modeling in practice.

Abstract

Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.

Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

TL;DR

The paper tackles the privacy-utility trade-off in differentially private diffusion models by introducing DP-SynGen, a framework that injects programmatically generated synthetic data into selective diffusion stages. By pre-training or partially training the coarse or cleaning phases on synthetic data and carefully scheduling where private data is used, DP-SynGen reduces the number of privacy-perturbed iterations and the overall privacy budget while preserving, or improving, sample quality. The approach is supported by toy-theory analyses and two main theorems, showing that there exists a diffusion time where synthetic data yields comparable learning outcomes to private data, enabling stage-wise substitution without privacy leakage. Empirically, DP-SynGen variants (Coarse, Cleaning, and FineTune) demonstrate competitive or superior FID and CAS scores on MNIST, Fashion-MNIST, and CelebA datasets, particularly under tight privacy budgets, with Dead-leaves synthetic data aiding cleaning stages and elbow-point thresholds guiding stage allocation. Overall, the work provides a practical, theoretically grounded path to enhancing differentially private generative diffusion models using synthetic data, potentially broadening the applicability of private generative modeling in practice.

Abstract

Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.

Paper Structure

This paper contains 22 sections, 4 theorems, 15 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

For any two different data distributions $X_0$ and $Y_0$, let $X_t$ and $Y_t$ denote their respective states under the forward diffusion process at time $t$, as defined in Equation eq:diffusion. Then, for any $\nu$ and $\gamma$, we can find $N$ such that for any $n\geq N$, following satisfies:

Figures (11)

  • Figure 1: Samples context stage $t \in (250, 750]$ with (a) synthetic data- and (b) private data-trained model. For other stages, the private data- and synthetic data-trained model are used, respectively.
  • Figure 2: Samples from (a) diffusion model trained with private data for total diffusion process and (b) diffusion model trained with synthetic data for coarse stage ($t > 750$) and private data with other stages ($t \leq 750$).
  • Figure 3: Samples from (a) diffusion model trained with private data for total diffusion process and (b) diffusion model trained with synthetic data for cleaning stage ($t \leq 250$) and private data with other stages ($t > 250$).
  • Figure 4: Illustration of (a) DP-SynGen Coarse and (b) DP-SynGen Cleaning, with diffusion process from 0 (Image) to T (Noise). The gray range indicates training with synthetic data, while the blue range indicates training with private data.
  • Figure 5: Visualization of $\bar{\alpha}_\sigma$ and SNR to search the threshold $\tau$
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof : sketch of proof
  • Theorem 2
  • proof : sketch of proof
  • Theorem : Restatement of Theorem \ref{['thm:coarse']}
  • proof
  • Theorem : Restatement of Theorem \ref{['thm:cleaning']}
  • proof