Table of Contents
Fetching ...

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, Nan Du

TL;DR

This paper tackles the scalability of diffusion-based text-to-image generation by introducing EC-DiT, a diffusion transformer architecture that uses adaptive expert-choice routing to allocate computation unevenly across image patches and diffusion steps. By replacing dense FFNs with MoE layers that select tokens per expert using global sequence context, EC-DiT achieves up to $97$B parameters with improved training convergence and text-to-image alignment, surpassing dense and token-choice MoE baselines in GenEval and DSG metrics. The approach demonstrates meaningful gains in image quality and alignment with only modest inference overhead, validated by extensive ablations across model sizes, number of experts, and compute allocation patterns. Overall, EC-DiT offers a scalable, interpretable, and efficient pathway to large-scale diffusion models with adaptive, context-aware computation for complex text prompts.

Abstract

Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

TL;DR

This paper tackles the scalability of diffusion-based text-to-image generation by introducing EC-DiT, a diffusion transformer architecture that uses adaptive expert-choice routing to allocate computation unevenly across image patches and diffusion steps. By replacing dense FFNs with MoE layers that select tokens per expert using global sequence context, EC-DiT achieves up to B parameters with improved training convergence and text-to-image alignment, surpassing dense and token-choice MoE baselines in GenEval and DSG metrics. The approach demonstrates meaningful gains in image quality and alignment with only modest inference overhead, validated by extensive ablations across model sizes, number of experts, and compute allocation patterns. Overall, EC-DiT offers a scalable, interpretable, and efficient pathway to large-scale diffusion models with adaptive, context-aware computation for complex text prompts.

Abstract

Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.
Paper Structure (30 sections, 8 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 8 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Performance of EC-DiT. (a) Across four model configurations, EC-DiT consistently demonstrates superior performance in text-to-image alignment compared to the baseline models with similar activated parameters per prediction. (b) Scaling up with EC-DiT improves text-to-image alignment and visual detail rendering. The alignment score in (a) is the average of GenEval geneval and DSG scores dsg.
  • Figure 2: EC-DiT architecture. The router leverages information from the entire sequence to adaptively route the most suitable tokens to each expert. Through this heterogeneous routing, more computation is allocated to detailed image areas, such as the space station and moon, while less computation is used for rendering the background.
  • Figure 3: DSG comparison. Scaling up with EC-DiT elevates performance.
  • Figure 4: Inference-time efficiency. The circle size is proportional to the total activated parameters. Inference time represents the time elapsed to generate 500 samples on $8\times$H100 GPUs. EC-DiT shows superior performance compared to dense models, with less than 30% additional overhead.
  • Figure 5: Comparison with token-choice baselines on FID and CLIP Score of EC-DiT-XXL (EC) and GShard (GS) with top-2 token-choice routing.
  • ...and 6 more figures