Table of Contents
Fetching ...

Scaling Diffusion Transformers Efficiently via $μ$P

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

TL;DR

This work generalizes Maximal Update Parametrization ($\mu$P) to diffusion Transformers, proving that forward passes of mainstream variants (U-ViT, DiT, PixArt-$\alpha$, MMDiT) conform to the standard $\mu$P formulation via the NexorT program. It establishes robust base hyperparameter transferability across widths, batch sizes, and training steps, and introduces a $\mu$Transfer protocol to move optimal base HPs from proxy models to target scales. Empirically, $\mu$P accelerates training and reduces tuning costs across several large diffusion models, achieving up to 2.9x faster convergence (DiT-XL-2-$\mu$P) and efficient scaling of PixArt-$\alpha$ (0.04B to 0.61B) and MMDiT (0.18B to 18B) with only a few percent of tuning FLOPs. The results position $\mu$P as a principled, scalable framework for diffusion Transformers, enabling practical large-scale generative modeling with limited hyperparameter tuning.

Abstract

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($μ$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $μ$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $μ$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $μ$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$α$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $μ$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$μ$P enjoys robust HP transferability. Notably, DiT-XL-2-$μ$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $μ$P on text-to-image generation by scaling PixArt-$α$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $μ$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$α$ and 3% of consumption by human experts for MMDiT-18B. These results establish $μ$P as a principled and efficient framework for scaling diffusion Transformers.

Scaling Diffusion Transformers Efficiently via $μ$P

TL;DR

This work generalizes Maximal Update Parametrization (P) to diffusion Transformers, proving that forward passes of mainstream variants (U-ViT, DiT, PixArt-, MMDiT) conform to the standard P formulation via the NexorT program. It establishes robust base hyperparameter transferability across widths, batch sizes, and training steps, and introduces a Transfer protocol to move optimal base HPs from proxy models to target scales. Empirically, P accelerates training and reduces tuning costs across several large diffusion models, achieving up to 2.9x faster convergence (DiT-XL-2-P) and efficient scaling of PixArt- (0.04B to 0.61B) and MMDiT (0.18B to 18B) with only a few percent of tuning FLOPs. The results position P as a principled, scalable framework for diffusion Transformers, enabling practical large-scale generative modeling with limited hyperparameter tuning.

Abstract

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing P methodologies. Leveraging this result, we systematically demonstrate that DiT-P enjoys robust HP transferability. Notably, DiT-XL-2-P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of P on text-to-image generation by scaling PixArt- from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt- and 3% of consumption by human experts for MMDiT-18B. These results establish P as a principled and efficient framework for scaling diffusion Transformers.

Paper Structure

This paper contains 58 sections, 4 theorems, 25 equations, 10 figures, 17 tables, 4 algorithms.

Key Result

Theorem 3.1

The forward passes of mainstream diffusion Transformers (U-ViT DBLP:conf/cvpr/uvit, DiT DBLP:conf/iccv/dit, Pixart-$\alpha$DBLP:conf/iclr/pixelart, and MMDiT DBLP:conf/icml/sd3) can be represented within the NexorT Program. Therefore, their $\mu$P matches the standard $\mu$P presented in Table tab:

Figures (10)

  • Figure 1: Visualization results and efficiency of HP search under $\mu$P. (a) Samples generated by the MMDiT-$\mu$P-18B model exhibit strong fidelity and precision in aligning with the provided textual descriptions. (b) HP search for large diffusion Transformers is efficient under $\mu$P, requiring only $5.5\%$ FLOPs of a single training run for PixArt-$\alpha$ and just $3\%$ FLOPs of the human experts for MMDiT-18B.
  • Figure 2: A overview of applying $\mu$P to diffusion Transformers. (a) We illustrate the implementation of $\mu$P for DiT as an example. The $abc$-parameterization of each weight is adjusted based on its type and visualized using different colors. Modules that differ from the vanilla Transformer are also highlighted. (b) We $\mu$Transfer the optimal base HPs searched from multiple trials on small models to pretrain the target large models.
  • Figure 3: DiT-$\mu$P enjoys base HP transferability. Unless otherwise specified, we use a model width of 288, a batch size of 256, and a training iteration of 200K. Missing data points indicate training instability, where the loss explodes. Under $\mu$P, the base learning rate can be transferred across model widths, batch sizes, and steps.
  • Figure 4: $\mu$P accerlates the training of diffusion Transformers. Considering FID-50K, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9$\times$ faster convergence than the original DiT-XL-2 and a slightly better result.
  • Figure 5: Results of base HP search on proxy MMDiT-$\mu$P tasks. We train 0.18B MMDiT-$\mu$P proxy models with 80 different base HPs settings. The optimal base HPs are transferred to the training of 18B target model.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Theorem 3.1: $\mu$P of diffusion Transformers, proof in Appendix \ref{['app: proof']}
  • Lemma B.1: Scalars-to-scalars transformation is representable by the NexorT Program, Lemma 2.6.2 in DBLP:conf/iclr/LittwinY23-TP4b
  • Lemma B.2: $\mu$P of representable architecture, Definition 2.9.12 in DBLP:conf/iclr/LittwinY23-TP4b
  • Theorem C.1
  • proof
  • proof
  • proof
  • proof