DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

Tao Huang; Jiayang Meng; Xu Yang; Chen Hou; Hong Chen

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

Tao Huang, Jiayang Meng, Xu Yang, Chen Hou, Hong Chen

TL;DR

DP-aware AdaLN-Zero is proposed, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism, and shows that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.

Abstract

Condition injection enables diffusion models to generate context-aware outputs, which is essential for many time-series tasks. However, heterogeneous conditional contexts (e.g., observed history, missingness patterns or outlier covariates) can induce heavy-tailed per-example gradients. Under Differentially Private Stochastic Gradient Descent (DP-SGD), these rare conditioning-driven heavy-tailed gradients disproportionately trigger global clipping, resulting in outlier-dominated updates, larger clipping bias, and degraded utility under a fixed privacy budget. In this paper, we propose DP-aware AdaLN-Zero, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism. DP-aware AdaLN-Zero jointly constrains conditioning representation magnitude and AdaLN modulation parameters via bounded re-parameterization, suppressing extreme gradient tail events before gradient clipping and noise injection. Empirically, DP-SGD equipped with DP-aware AdaLN-Zero improves interpolation/imputation and forecasting under matched privacy settings. We observe consistent gains on a real-world power dataset and two public ETT benchmarks over vanilla DP-SGD. Moreover, gradient diagnostics attribute these improvements to conditioning-specific tail reshaping and reduced clipping distortion, while preserving expressiveness in non-private training. Overall, these results show that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

TL;DR

Abstract

Paper Structure (55 sections, 1 theorem, 54 equations, 5 figures, 19 tables)

This paper contains 55 sections, 1 theorem, 54 equations, 5 figures, 19 tables.

Introduction
Related Work
Diffusion models and conditional generation.
Diffusion models for time series.
Differential privacy for deep learning.
Differentially private diffusion models.
Limitations in differentially private diffusion work.
Method
Preliminaries
Conditional Diffusion.
Vanilla DP-SGD.
Structural Limits of Global Clipping
DP-aware AdaLN-Zero Design
Sensitivity Analysis of DP-Aware AdaLN-Zero
Diagnostic of Gradient Dynamics
...and 40 more sections

Key Result

Proposition 3.1

Under the DP-aware constraints (Eq.(eq:dp_aware_bounds_main)) and the standard regularity assumptions, there exist non-negative, architecture-dependent constants $A_0,a_c,a_\gamma,a_\beta,a_\alpha \ge 0$, such that for every training example $z=(x,\mathbf{c})$,

Figures (5)

Figure 1: Condition-amplified extremes exist even without DP. We compare normal training with training under DP-aware constraints without DP. Under normal training, $\|g_{\mathrm{cond}}\|_2$ is comparable to $\|g_{\mathrm{other}}\|_2$ at typical quantiles (see the blue curve in Figure \ref{['fig:gradnorm:ecdf:no-dp']}), but $\|g_{\mathrm{cond}}\|_2$ exhibits rarer and heavier high-end tail events (see the blue curve in Figure \ref{['fig:gradnorm:tail:no-dp']}). In contrast, DP-aware constraints selectively suppress the high-end tail of $\|g_{\mathrm{cond}}\|_2$ (and consequently that of $\|g\|_2$) far more than $\|g_{\mathrm{other}}\|_2$: $p99$ drops by $\sim 3.5\times$ for $\|g_{\mathrm{cond}}\|_2$ vs. $\sim 1.2\times$ for $\|g_{\mathrm{other}}\|_2$. This indicates targeted suppression of conditioning-induced amplification rather than uniform shrinkage, and suggests that (in DP-SGD) clipping events would be disproportionately governed by rare conditioning-path extremes.
Figure 2: Gradient-norm distributions under DP training. Compared to DP-vanilla, DP-aware primarily suppresses the extreme tail of the conditioning pathway with minimal impact on the bulk of the distribution, indicating fewer condition-amplified outliers rather than uniform gradient shrinkage.
Figure 3: Gradient distributions under DP training (all $\sigma$ settings).
Figure 4: Clipping behavior under threshold $C$ (all $\sigma$ settings). We compare DP-vanilla and DP-aware in terms of clipping factor $\eta=\min(1,\frac{C}{\|g_{\mathrm{total}}\|})$ and clipping activation rate $p_{\mathrm{clip}}=\mathbb{P}(\|g_{\mathrm{total}}\|>C)$.
Figure 5: Training loss dynamics on PrivatePower. Training MSE loss curves for Non-DP, DP-vanilla, and DP-aware under several DP noise multipliers. Larger $\sigma$ yields a higher loss floor, while DP-aware closely follows DP-vanilla across noise levels.

Theorems & Definitions (1)

Proposition 3.1: Per-Example Gradient Bound with DP-aware Constraints

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

TL;DR

Abstract

DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)