Table of Contents
Fetching ...

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Eric Tillman Bill, Cristian Perez Jensen, Sotiris Anagnostidis, Dimitri von Rütte

TL;DR

This work addresses training stability in diffusion transformers by enforcing magnitude preservation throughout the DiT architecture and introducing a novel rotation-based conditioning called rotation modulation. The authors develop theoretical guarantees and practical mechanisms, including cosine attention, weight normalization, and rotation-based conditioning, to keep activation magnitudes bounded without traditional normalization layers. Empirical ablations on small-scale DiT models show substantial improvements in FID-10K (≈12.8% reduction) and competitive performance with AdaLN while using fewer parameters; they also provide a detailed analysis of magnitude evolution and convergence. The approach broadens conditioning strategies for diffusion transformers and offers a publicly releasable implementation, with potential applicability to larger models and text-to-image settings.

Abstract

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

TL;DR

This work addresses training stability in diffusion transformers by enforcing magnitude preservation throughout the DiT architecture and introducing a novel rotation-based conditioning called rotation modulation. The authors develop theoretical guarantees and practical mechanisms, including cosine attention, weight normalization, and rotation-based conditioning, to keep activation magnitudes bounded without traditional normalization layers. Empirical ablations on small-scale DiT models show substantial improvements in FID-10K (≈12.8% reduction) and competitive performance with AdaLN while using fewer parameters; they also provide a detailed analysis of magnitude evolution and convergence. The approach broadens conditioning strategies for diffusion transformers and offers a publicly releasable implementation, with potential applicability to larger models and text-to-image settings.

Abstract

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by 12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring 5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

Paper Structure

This paper contains 25 sections, 7 theorems, 28 equations, 14 figures, 6 tables.

Key Result

Lemma 1

Let $\mathbf{A} \in \mathbb{R}^{T \times T}$ be an unnormalized attention map and $\mathbf{V} \in \mathbb{R}^{T \times n}$ with $\mathcal{M}\mathopen{}\mathclose{\left[\mathbf{v}_t\right]} = \sigma$ for all $t \in [T]$. Further, define attention as Then, $\mathcal{M}\mathopen{}\mathclose{\left[\operatorname{att}(\mathbf{A},\mathbf{V})_t\right]} \leq \sigma$ for all $t \in [T]$. As $\beta \to 0$,

Figures (14)

  • Figure 1: Effect of magnitude preservation and weight control on DiT-S/4.
  • Figure 2: Four samples from all models and configurations, where all samples were generated with the same seed, a guidance scale of 5.0, and an EMA relative standard deviation of 10%. The samples include "alp" (970), "schooner" (780), "cock" (7), and "daisy" (985).
  • Figure 3: ema Decaying Factor Curves. Decaying factor $Z(t-1)/Z(t)$ ($y$-axis) over the first 1000 steps ($x$-axis) for various relative standard deviations.
  • Figure 4: FID-10K across varying $\sigma_{\mathrm{rel}}$ values. Experiments were conducted using dit-XS/2 using Config E.
  • Figure 5: Activation magnitude evolution across DiT blocks in DiT-S/4. For blocks 1–12, we show mean activation magnitudes (averaged over all labels and timesteps), with shaded areas representing $\pm3$ standard deviations. Plotted are the AdaLN-modulated input (see \ref{['sec:modulation']}) and the output after the residual connection, for both the self-attention and MLP modules. Results are shown at initialization (left) and after 400K training steps (right) for each configuration.
  • ...and 9 more figures

Theorems & Definitions (7)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Corollary 7