On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Rongzhe Wei; Eleonora Kreačić; Haoyu Wang; Haoteng Yin; Eli Chien; Vamsi K. Potluru; Pan Li

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Rongzhe Wei, Eleonora Kreačić, Haoyu Wang, Haoteng Yin, Eli Chien, Vamsi K. Potluru, Pan Li

TL;DR

This work analyzes the inherent privacy of discrete diffusion models (DDMs) for discrete data through per-instance differential privacy (pDP). It derives a data-dependent pDP bound showing privacy leakage increases along the generative trajectory, with a main term that scales with dataset size and diffusion dynamics, and an error term capturing model-training and trajectory discrepancies; faster diffusion decay improves privacy, and the bound is tight when generating a single sample. The authors also provide data-dependent quantities and algorithms to estimate per-instance leakage on real datasets, enabling data-curation when releasing synthetic data. Experiments on synthetic and real data validate the theory and reveal practical privacy-utility trade-offs, including vulnerabilities to membership-inference attacks that can be mitigated by adjusting diffusion schedules. The results suggest that while DDMs offer useful synthetic data, they generally require additional privacy mechanisms (e.g., DP-SGD or PATE) for strong guarantees, and they provide a principled way to identify and prune privacy-sensitive data points prior to training. The work thus contributes a rigorous, data-aware framework for assessing and potentially improving privacy in diffusion-based synthetic data generation.

Abstract

Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage from $(ε, O(\frac{1}{s^2ε}))$-pDP to $(ε, O(\frac{1}{sε}))$-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

TL;DR

Abstract

-sized data points leads to a surge in privacy leakage from

-pDP to

-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.

Paper Structure (55 sections, 27 theorems, 180 equations, 11 figures, 2 algorithms)

This paper contains 55 sections, 27 theorems, 180 equations, 11 figures, 2 algorithms.

Introduction
More Related Work
Preliminaries
Main Results
Inherent Privacy Guarantees of DDMs
Impact of DDM Coefficients and Dataset Distributions on the Privacy Bound
Characterizing Data-dependent Quantities under Simple Distributions
The Algorithm for Evaluating Privacy Bound in Eq. \ref{['eq.main_theorem_1']} on a given Dataset
Experiments
Synthetic Experiments
Experiments on Real Datasets
Effectiveness of Privacy Bound Algorithm for Data Sensitivity Assessment
Evaluation of DDM Vulnerability to Black-box Membership Inference Attacks
Conclusion
Limitations and Future Work
...and 40 more sections

Key Result

Theorem 1

Given a dataset $\mathcal{V}_0$ with size $|\mathcal{V}_0| = s+1$ and a data point $\mathbf{v}^*\in \mathcal{V}_0$ to be protected, denote $\mathcal{V}_1$ such that $\mathcal{V}_1=\mathcal{V}_0 \backslash \{\mathbf{v}^*\}$. Assume the denoising networks trained on $\mathcal{V}_0$ and $\mathcal{V}_1$ where $\psi_t, \eta_t, c_t^*$ are data-dependent quantities determined by $\mathbf{v}^*$ and $\math

Figures (11)

Figure 1: An Illustration of Discrete Diffusion Models (DDMs).
Figure 2: Illustration of Data-dependent Quantities.
Figure 3: Illustration of the correlation between dataset similarity ($\text{Sim}(\mathbf{v}_i, \mathcal{V}_0 \backslash \{\mathbf{v}_i\}), \forall \mathbf{v}_i \in \mathcal{V}_0$) and pDP Leakage.
Figure 4: pDP Leakage in Eq. \ref{['eq.main_theorem_1']}: LEFT: Characterization of $\frac{n}{s^{\psi_t}}$. MIDDLE: Characterization of $(1 + c_t^*)\eta_t$. RIGHT: Characterization of Privacy Leakage (Main Privacy Term). Experimental Setup: Given specific DDM design $k = 5, n = 5, T = 20, \epsilon = 10$ trained on dataset with $s = 1000$ following the distribution in Sec. \ref{['subsec.eg']} with parameter $p$. Fix $\mathbf{v}^*$ where each column has a non-majority category. Results are based on 5 times independent tests.
Figure 5: LEFT: Privacy Leakage at $t = 1$ (Noise-free Regime). MIDDLE: Privacy Leakage at $t = 50$ (Noisy Regime). Right: Privacy Leakage w.r.t Decay Rate under Linear ($\alpha_t = 1 - \text{decay rate} * \frac{t}{T}$) and Sigmoid ($\alpha_t = \frac{\text{Sigmoid}(3 * \text{decay rate}) - \text{Sigmoid}(\frac{3t}{T} * \text{decay rate})}{\text{Sigmoid}(3 * \text{decay rate}) - 0.5}$) Schedules. Results are based on $5$ times independent tests.
...and 6 more figures

Theorems & Definitions (48)

Definition 1: ($\epsilon, \delta$)-Per-instance Differential Privacy (pDP) wang2019per
Theorem 1: Inherent pDP Guarantees for DDMs
Theorem 2: Lower Bound on Inherent pDP Guarantees for DDMs
Theorem 3: Inherent DP Guarantee for DDMs (Informal)
Lemma A.1: Characterizing pDP with Coupled KL Divergence
Lemma A.2
Lemma A.3
Lemma A.4: Upper Bounding Coupled Conditional KL
Lemma A.5
Lemma A.6
...and 38 more

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

TL;DR

Abstract

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (48)