Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training
Yujin Choi, Jinseong Park, Junyoung Byun, Jaewook Lee
TL;DR
The paper tackles the privacy-utility trade-off in differentially private diffusion models by introducing DP-SynGen, a framework that injects programmatically generated synthetic data into selective diffusion stages. By pre-training or partially training the coarse or cleaning phases on synthetic data and carefully scheduling where private data is used, DP-SynGen reduces the number of privacy-perturbed iterations and the overall privacy budget while preserving, or improving, sample quality. The approach is supported by toy-theory analyses and two main theorems, showing that there exists a diffusion time where synthetic data yields comparable learning outcomes to private data, enabling stage-wise substitution without privacy leakage. Empirically, DP-SynGen variants (Coarse, Cleaning, and FineTune) demonstrate competitive or superior FID and CAS scores on MNIST, Fashion-MNIST, and CelebA datasets, particularly under tight privacy budgets, with Dead-leaves synthetic data aiding cleaning stages and elbow-point thresholds guiding stage allocation. Overall, the work provides a practical, theoretically grounded path to enhancing differentially private generative diffusion models using synthetic data, potentially broadening the applicability of private generative modeling in practice.
Abstract
Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.
