On the Inherent Privacy Properties of Discrete Denoising Diffusion Models
Rongzhe Wei, Eleonora Kreačić, Haoyu Wang, Haoteng Yin, Eli Chien, Vamsi K. Potluru, Pan Li
TL;DR
This work analyzes the inherent privacy of discrete diffusion models (DDMs) for discrete data through per-instance differential privacy (pDP). It derives a data-dependent pDP bound showing privacy leakage increases along the generative trajectory, with a main term that scales with dataset size and diffusion dynamics, and an error term capturing model-training and trajectory discrepancies; faster diffusion decay improves privacy, and the bound is tight when generating a single sample. The authors also provide data-dependent quantities and algorithms to estimate per-instance leakage on real datasets, enabling data-curation when releasing synthetic data. Experiments on synthetic and real data validate the theory and reveal practical privacy-utility trade-offs, including vulnerabilities to membership-inference attacks that can be mitigated by adjusting diffusion schedules. The results suggest that while DDMs offer useful synthetic data, they generally require additional privacy mechanisms (e.g., DP-SGD or PATE) for strong guarantees, and they provide a principled way to identify and prune privacy-sensitive data points prior to training. The work thus contributes a rigorous, data-aware framework for assessing and potentially improving privacy in diffusion-based synthetic data generation.
Abstract
Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage from $(ε, O(\frac{1}{s^2ε}))$-pDP to $(ε, O(\frac{1}{sε}))$-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.
