Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers
Eric Tillman Bill, Cristian Perez Jensen, Sotiris Anagnostidis, Dimitri von Rütte
TL;DR
This work addresses training stability in diffusion transformers by enforcing magnitude preservation throughout the DiT architecture and introducing a novel rotation-based conditioning called rotation modulation. The authors develop theoretical guarantees and practical mechanisms, including cosine attention, weight normalization, and rotation-based conditioning, to keep activation magnitudes bounded without traditional normalization layers. Empirical ablations on small-scale DiT models show substantial improvements in FID-10K (≈12.8% reduction) and competitive performance with AdaLN while using fewer parameters; they also provide a detailed analysis of magnitude evolution and convergence. The approach broadens conditioning strategies for diffusion transformers and offers a publicly releasable implementation, with potential applicability to larger models and text-to-image settings.
Abstract
Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.
