Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Eric Tillman Bill; Cristian Perez Jensen; Sotiris Anagnostidis; Dimitri von Rütte

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Eric Tillman Bill, Cristian Perez Jensen, Sotiris Anagnostidis, Dimitri von Rütte

TL;DR

This work addresses training stability in diffusion transformers by enforcing magnitude preservation throughout the DiT architecture and introducing a novel rotation-based conditioning called rotation modulation. The authors develop theoretical guarantees and practical mechanisms, including cosine attention, weight normalization, and rotation-based conditioning, to keep activation magnitudes bounded without traditional normalization layers. Empirical ablations on small-scale DiT models show substantial improvements in FID-10K (≈12.8% reduction) and competitive performance with AdaLN while using fewer parameters; they also provide a detailed analysis of magnitude evolution and convergence. The approach broadens conditioning strategies for diffusion transformers and offers a publicly releasable implementation, with potential applicability to larger models and text-to-image settings.

Abstract

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

TL;DR

Abstract

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (7)