Table of Contents
Fetching ...

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan

TL;DR

This work addresses the limited gains of applying Mixture-of-Experts (MoE) to Diffusion Transformers (DiTs) by identifying visual token properties—high spatial redundancy and functional heterogeneity—that hinder expert specialization. It introduces ProMoE, a two-step router that first performs conditional routing to separate unconditional and conditional image tokens, then uses prototypical routing with learnable latent prototypes to assign conditional tokens semantically to experts; a routing contrastive loss (RCL) further promotes intra-expert coherence and inter-expert diversity. Empirical results on ImageNet show ProMoE consistently outperforms dense DiTs and existing MoE SOTAs under both DDPM and Rectified Flow objectives, with notable gains at larger scales and with fewer activated parameters, demonstrating improved efficiency and scalability. The approach establishes a robust pathway to scale diffusion-based vision models via explicit routing guidance and semantic prototyping, with potential applicability to other modalities and conditioning regimes.

Abstract

Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

TL;DR

This work addresses the limited gains of applying Mixture-of-Experts (MoE) to Diffusion Transformers (DiTs) by identifying visual token properties—high spatial redundancy and functional heterogeneity—that hinder expert specialization. It introduces ProMoE, a two-step router that first performs conditional routing to separate unconditional and conditional image tokens, then uses prototypical routing with learnable latent prototypes to assign conditional tokens semantically to experts; a routing contrastive loss (RCL) further promotes intra-expert coherence and inter-expert diversity. Empirical results on ImageNet show ProMoE consistently outperforms dense DiTs and existing MoE SOTAs under both DDPM and Rectified Flow objectives, with notable gains at larger scales and with fewer activated parameters, demonstrating improved efficiency and scalability. The approach establishes a robust pathway to scale diffusion-based vision models via explicit routing guidance and semantic prototyping, with potential applicability to other modalities and conditioning regimes.

Abstract

Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

Paper Structure

This paper contains 29 sections, 6 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) We randomly sample 1k intermediate-layer tokens from 110 ImageNet classes for 10-cluster k-means clustering (differentiated by color). With class names/labels as inputs, LLM tokens form compact, well-separated clusters with high semantic density, whereas visual tokens are diffuse. This disparity is quantified by the ratio of inter- to intra-class distance ($19.283\gg0.748$). (b) We measure inter-expert diversity using singular value decomposition on each MoE layer's expert weight matrices and computing the mean similarity of the subspaces spanned by their top-k left singular vectors hu2021lora. Incorporating routing guidance (Ours) enhances expert diversity.
  • Figure 2: Overview of ProMoE architecture. The input tokens are split by conditional routing into unconditional and conditional subsets. Unconditional image tokens are processed by unconditional experts. Conditional image tokens are assigned by prototypical routing with learnable prototypes. The routing contrastive learning explicitly enhances semantic guidance in prototypical routing.
  • Figure 3: Comparisons and scaling results across diverse settings.
  • Figure 4: Samples generated by ProMoE-XL-Flow after 2M iterations with cfg=4.0.
  • Figure 5: Training loss curve comparisons.
  • ...and 7 more figures