Table of Contents
Fetching ...

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai

TL;DR

DiffMoE tackles the inefficiency of uniform input processing in diffusion transformers by introducing a batch-level global token pool and a dynamic capacity predictor for inference. The method enables cross-sample token interaction and adaptive resource allocation, yielding state-of-the-art performance on ImageNet diffusion tasks while maintaining parameter efficiency. It demonstrates strong results in text-to-image generation and provides detailed analysis of dynamic computation, scaling behavior, and inference trade-offs. Collectively, DiffMoE offers a scalable framework for diffusion models with broad applicability, including high-quality text-conditioned generation, while highlighting practical considerations for deployment and evaluation.

Abstract

Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

TL;DR

DiffMoE tackles the inefficiency of uniform input processing in diffusion transformers by introducing a batch-level global token pool and a dynamic capacity predictor for inference. The method enables cross-sample token interaction and adaptive resource allocation, yielding state-of-the-art performance on ImageNet diffusion tasks while maintaining parameter efficiency. It demonstrates strong results in text-to-image generation and provides detailed analysis of dynamic computation, scaling behavior, and inference trade-offs. Collectively, DiffMoE offers a scalable framework for diffusion models with broad applicability, including high-quality text-conditioned generation, while highlighting practical considerations for deployment and evaluation.

Abstract

Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/

Paper Structure

This paper contains 37 sections, 30 equations, 18 figures, 15 tables, 4 algorithms.

Figures (18)

  • Figure 1: Token Accessibility and Dynamic Computation.(a) Token accessibility levels from token isolation to cross-sample interaction. Colors represent tokens in different samples, $t_i$ indicates noise levels. (b) Performance-accessibility analysis across architectures. (c) Computational dynamics during diffusion sampling, showing adaptive computation from noise to image. (d) Class-wise computation allocation from hard (technical diagrams) to easy (natural photos) tasks. Results from DiffMoE-L-E16-Flow (700K).
  • Figure 2: DiffMoE Architecture Overview. DiffMoE flattens tokens into a batch-level global token pool, where each expert maintains a fixed training capacity of $C_{\text{train}}^{E_i}=1$. During inference, a dynamic capacity predictor adaptively routes tokens across different sampling steps and conditions. Different colors denote tokens from distinct samples, while $t_i$ represents corresponding noise levels.
  • Figure 3: Training Loss Curves of Different Flow-based Models. DiffMoE with batch-level global token pool achieves consistently lower diffusion losses than baselines without batch-level global token pool.
  • Figure 4: Comparisons with the Baseline Models. We compare TC, EC, and Dense Models. DiffMoE-L-E16-Flow even surpasses the DenseDiT-XL-Flow (1.5x params) by achieving the best quality (14.41 FID50K w/o CFG at 700K). The results of the DDPM method remain consistent with those provided in the Appendix \ref{['appendix:sec_ddpm_reults']}.
  • Figure 5: Text-to-Image Generation Loss Curves. Training loss comparison between DiffMoE-E16-T2I-Flow and Dense-DiT-T2I-Flow models over 160K steps. DiffMoE consistently achieves lower loss values, demonstrating superior convergence efficiency compared to the dense baseline.
  • ...and 13 more figures