Table of Contents
Fetching ...

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

Tao Huang, Jiayang Meng, Xu Yang, Chen Hou, Hong Chen

TL;DR

DP-aware AdaLN-Zero is proposed, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism, and shows that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.

Abstract

Condition injection enables diffusion models to generate context-aware outputs, which is essential for many time-series tasks. However, heterogeneous conditional contexts (e.g., observed history, missingness patterns or outlier covariates) can induce heavy-tailed per-example gradients. Under Differentially Private Stochastic Gradient Descent (DP-SGD), these rare conditioning-driven heavy-tailed gradients disproportionately trigger global clipping, resulting in outlier-dominated updates, larger clipping bias, and degraded utility under a fixed privacy budget. In this paper, we propose DP-aware AdaLN-Zero, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism. DP-aware AdaLN-Zero jointly constrains conditioning representation magnitude and AdaLN modulation parameters via bounded re-parameterization, suppressing extreme gradient tail events before gradient clipping and noise injection. Empirically, DP-SGD equipped with DP-aware AdaLN-Zero improves interpolation/imputation and forecasting under matched privacy settings. We observe consistent gains on a real-world power dataset and two public ETT benchmarks over vanilla DP-SGD. Moreover, gradient diagnostics attribute these improvements to conditioning-specific tail reshaping and reduced clipping distortion, while preserving expressiveness in non-private training. Overall, these results show that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

TL;DR

DP-aware AdaLN-Zero is proposed, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism, and shows that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.

Abstract

Condition injection enables diffusion models to generate context-aware outputs, which is essential for many time-series tasks. However, heterogeneous conditional contexts (e.g., observed history, missingness patterns or outlier covariates) can induce heavy-tailed per-example gradients. Under Differentially Private Stochastic Gradient Descent (DP-SGD), these rare conditioning-driven heavy-tailed gradients disproportionately trigger global clipping, resulting in outlier-dominated updates, larger clipping bias, and degraded utility under a fixed privacy budget. In this paper, we propose DP-aware AdaLN-Zero, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism. DP-aware AdaLN-Zero jointly constrains conditioning representation magnitude and AdaLN modulation parameters via bounded re-parameterization, suppressing extreme gradient tail events before gradient clipping and noise injection. Empirically, DP-SGD equipped with DP-aware AdaLN-Zero improves interpolation/imputation and forecasting under matched privacy settings. We observe consistent gains on a real-world power dataset and two public ETT benchmarks over vanilla DP-SGD. Moreover, gradient diagnostics attribute these improvements to conditioning-specific tail reshaping and reduced clipping distortion, while preserving expressiveness in non-private training. Overall, these results show that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.
Paper Structure (55 sections, 1 theorem, 54 equations, 5 figures, 19 tables)

This paper contains 55 sections, 1 theorem, 54 equations, 5 figures, 19 tables.

Key Result

Proposition 3.1

Under the DP-aware constraints (Eq.(eq:dp_aware_bounds_main)) and the standard regularity assumptions, there exist non-negative, architecture-dependent constants $A_0,a_c,a_\gamma,a_\beta,a_\alpha \ge 0$, such that for every training example $z=(x,\mathbf{c})$,

Figures (5)

  • Figure 1: Condition-amplified extremes exist even without DP. We compare normal training with training under DP-aware constraints without DP. Under normal training, $\|g_{\mathrm{cond}}\|_2$ is comparable to $\|g_{\mathrm{other}}\|_2$ at typical quantiles (see the blue curve in Figure \ref{['fig:gradnorm:ecdf:no-dp']}), but $\|g_{\mathrm{cond}}\|_2$ exhibits rarer and heavier high-end tail events (see the blue curve in Figure \ref{['fig:gradnorm:tail:no-dp']}). In contrast, DP-aware constraints selectively suppress the high-end tail of $\|g_{\mathrm{cond}}\|_2$ (and consequently that of $\|g\|_2$) far more than $\|g_{\mathrm{other}}\|_2$: $p99$ drops by $\sim 3.5\times$ for $\|g_{\mathrm{cond}}\|_2$ vs. $\sim 1.2\times$ for $\|g_{\mathrm{other}}\|_2$. This indicates targeted suppression of conditioning-induced amplification rather than uniform shrinkage, and suggests that (in DP-SGD) clipping events would be disproportionately governed by rare conditioning-path extremes.
  • Figure 2: Gradient-norm distributions under DP training. Compared to DP-vanilla, DP-aware primarily suppresses the extreme tail of the conditioning pathway with minimal impact on the bulk of the distribution, indicating fewer condition-amplified outliers rather than uniform gradient shrinkage.
  • Figure 3: Gradient distributions under DP training (all $\sigma$ settings).
  • Figure 4: Clipping behavior under threshold $C$ (all $\sigma$ settings). We compare DP-vanilla and DP-aware in terms of clipping factor $\eta=\min(1,\frac{C}{\|g_{\mathrm{total}}\|})$ and clipping activation rate $p_{\mathrm{clip}}=\mathbb{P}(\|g_{\mathrm{total}}\|>C)$.
  • Figure 5: Training loss dynamics on PrivatePower. Training MSE loss curves for Non-DP, DP-vanilla, and DP-aware under several DP noise multipliers. Larger $\sigma$ yields a higher loss floor, while DP-aware closely follows DP-vanilla across noise levels.

Theorems & Definitions (1)

  • Proposition 3.1: Per-Example Gradient Bound with DP-aware Constraints