Table of Contents
Fetching ...

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Rongzhe Wei, Eleonora Kreačić, Haoyu Wang, Haoteng Yin, Eli Chien, Vamsi K. Potluru, Pan Li

TL;DR

This work analyzes the inherent privacy of discrete diffusion models (DDMs) for discrete data through per-instance differential privacy (pDP). It derives a data-dependent pDP bound showing privacy leakage increases along the generative trajectory, with a main term that scales with dataset size and diffusion dynamics, and an error term capturing model-training and trajectory discrepancies; faster diffusion decay improves privacy, and the bound is tight when generating a single sample. The authors also provide data-dependent quantities and algorithms to estimate per-instance leakage on real datasets, enabling data-curation when releasing synthetic data. Experiments on synthetic and real data validate the theory and reveal practical privacy-utility trade-offs, including vulnerabilities to membership-inference attacks that can be mitigated by adjusting diffusion schedules. The results suggest that while DDMs offer useful synthetic data, they generally require additional privacy mechanisms (e.g., DP-SGD or PATE) for strong guarantees, and they provide a principled way to identify and prune privacy-sensitive data points prior to training. The work thus contributes a rigorous, data-aware framework for assessing and potentially improving privacy in diffusion-based synthetic data generation.

Abstract

Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage from $(ε, O(\frac{1}{s^2ε}))$-pDP to $(ε, O(\frac{1}{sε}))$-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

TL;DR

This work analyzes the inherent privacy of discrete diffusion models (DDMs) for discrete data through per-instance differential privacy (pDP). It derives a data-dependent pDP bound showing privacy leakage increases along the generative trajectory, with a main term that scales with dataset size and diffusion dynamics, and an error term capturing model-training and trajectory discrepancies; faster diffusion decay improves privacy, and the bound is tight when generating a single sample. The authors also provide data-dependent quantities and algorithms to estimate per-instance leakage on real datasets, enabling data-curation when releasing synthetic data. Experiments on synthetic and real data validate the theory and reveal practical privacy-utility trade-offs, including vulnerabilities to membership-inference attacks that can be mitigated by adjusting diffusion schedules. The results suggest that while DDMs offer useful synthetic data, they generally require additional privacy mechanisms (e.g., DP-SGD or PATE) for strong guarantees, and they provide a principled way to identify and prune privacy-sensitive data points prior to training. The work thus contributes a rigorous, data-aware framework for assessing and potentially improving privacy in diffusion-based synthetic data generation.

Abstract

Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with -sized data points leads to a surge in privacy leakage from -pDP to -pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.
Paper Structure (55 sections, 27 theorems, 180 equations, 11 figures, 2 algorithms)

This paper contains 55 sections, 27 theorems, 180 equations, 11 figures, 2 algorithms.

Key Result

Theorem 1

Given a dataset $\mathcal{V}_0$ with size $|\mathcal{V}_0| = s+1$ and a data point $\mathbf{v}^*\in \mathcal{V}_0$ to be protected, denote $\mathcal{V}_1$ such that $\mathcal{V}_1=\mathcal{V}_0 \backslash \{\mathbf{v}^*\}$. Assume the denoising networks trained on $\mathcal{V}_0$ and $\mathcal{V}_1$ where $\psi_t, \eta_t, c_t^*$ are data-dependent quantities determined by $\mathbf{v}^*$ and $\math

Figures (11)

  • Figure 1: An Illustration of Discrete Diffusion Models (DDMs).
  • Figure 2: Illustration of Data-dependent Quantities.
  • Figure 3: Illustration of the correlation between dataset similarity ($\text{Sim}(\mathbf{v}_i, \mathcal{V}_0 \backslash \{\mathbf{v}_i\}), \forall \mathbf{v}_i \in \mathcal{V}_0$) and pDP Leakage.
  • Figure 4: pDP Leakage in Eq. \ref{['eq.main_theorem_1']}: LEFT: Characterization of $\frac{n}{s^{\psi_t}}$. MIDDLE: Characterization of $(1 + c_t^*)\eta_t$. RIGHT: Characterization of Privacy Leakage (Main Privacy Term). Experimental Setup: Given specific DDM design $k = 5, n = 5, T = 20, \epsilon = 10$ trained on dataset with $s = 1000$ following the distribution in Sec. \ref{['subsec.eg']} with parameter $p$. Fix $\mathbf{v}^*$ where each column has a non-majority category. Results are based on 5 times independent tests.
  • Figure 5: LEFT: Privacy Leakage at $t = 1$ (Noise-free Regime). MIDDLE: Privacy Leakage at $t = 50$ (Noisy Regime). Right: Privacy Leakage w.r.t Decay Rate under Linear ($\alpha_t = 1 - \text{decay rate} * \frac{t}{T}$) and Sigmoid ($\alpha_t = \frac{\text{Sigmoid}(3 * \text{decay rate}) - \text{Sigmoid}(\frac{3t}{T} * \text{decay rate})}{\text{Sigmoid}(3 * \text{decay rate}) - 0.5}$) Schedules. Results are based on $5$ times independent tests.
  • ...and 6 more figures

Theorems & Definitions (48)

  • Definition 1: ($\epsilon, \delta$)-Per-instance Differential Privacy (pDP) wang2019per
  • Theorem 1: Inherent pDP Guarantees for DDMs
  • Theorem 2: Lower Bound on Inherent pDP Guarantees for DDMs
  • Theorem 3: Inherent DP Guarantee for DDMs (Informal)
  • Lemma A.1: Characterizing pDP with Coupled KL Divergence
  • Lemma A.2
  • Lemma A.3
  • Lemma A.4: Upper Bounding Coupled Conditional KL
  • Lemma A.5
  • Lemma A.6
  • ...and 38 more