Table of Contents
Fetching ...

Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

Chengxi Min, Wei Wang, Yahui Liu, Weixin Ye, Enver Sangineto, Qi Wang, Yao Zhao

TL;DR

This work addresses the inefficiency and opaqueness of routing in Soft MoE for vision by leveraging latent semantic structure in dispatch weights. It introduces a foreground-guided auxiliary loss that aligns expert activation with semantically meaningful foreground regions, and a lightweight LayerScale mechanism to stabilize information flow in skip connections. By extracting foreground priors with external networks and using a spatial overlap-based loss, the approach improves routing quality and convergence, achieving consistent gains on ImageNet-1K and smaller datasets while enhancing interpretability of expert specialization. The method requires minimal architectural changes and demonstrates promising potential for more efficient, semantically grounded MoE routing in vision models.

Abstract

Mixture-of-Experts (MoE) models have emerged as a promising direction for scaling vision architectures efficiently. Among them, Soft MoE improves training stability by assigning each token to all experts via continuous dispatch weights. However, current designs overlook the semantic structure which is implicitly encoded in these weights, resulting in suboptimal expert routing. In this paper, we discover that dispatch weights in Soft MoE inherently exhibit segmentation-like patterns but are not explicitly aligned with semantic regions. Motivated by this observation, we propose a foreground-guided enhancement strategy. Specifically, we introduce a spatially aware auxiliary loss that encourages expert activation to align with semantic foreground regions. To further reinforce this supervision, we integrate a lightweight LayerScale mechanism that improves information flow and stabilizes optimization in skip connections. Our method necessitates only minor architectural adjustments and can be seamlessly integrated into prevailing Soft MoE frameworks. Comprehensive experiments on ImageNet-1K and multiple smaller-scale classification benchmarks not only showcase consistent performance enhancements but also reveal more interpretable expert routing mechanisms.

Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

TL;DR

This work addresses the inefficiency and opaqueness of routing in Soft MoE for vision by leveraging latent semantic structure in dispatch weights. It introduces a foreground-guided auxiliary loss that aligns expert activation with semantically meaningful foreground regions, and a lightweight LayerScale mechanism to stabilize information flow in skip connections. By extracting foreground priors with external networks and using a spatial overlap-based loss, the approach improves routing quality and convergence, achieving consistent gains on ImageNet-1K and smaller datasets while enhancing interpretability of expert specialization. The method requires minimal architectural changes and demonstrates promising potential for more efficient, semantically grounded MoE routing in vision models.

Abstract

Mixture-of-Experts (MoE) models have emerged as a promising direction for scaling vision architectures efficiently. Among them, Soft MoE improves training stability by assigning each token to all experts via continuous dispatch weights. However, current designs overlook the semantic structure which is implicitly encoded in these weights, resulting in suboptimal expert routing. In this paper, we discover that dispatch weights in Soft MoE inherently exhibit segmentation-like patterns but are not explicitly aligned with semantic regions. Motivated by this observation, we propose a foreground-guided enhancement strategy. Specifically, we introduce a spatially aware auxiliary loss that encourages expert activation to align with semantic foreground regions. To further reinforce this supervision, we integrate a lightweight LayerScale mechanism that improves information flow and stabilizes optimization in skip connections. Our method necessitates only minor architectural adjustments and can be seamlessly integrated into prevailing Soft MoE frameworks. Comprehensive experiments on ImageNet-1K and multiple smaller-scale classification benchmarks not only showcase consistent performance enhancements but also reveal more interpretable expert routing mechanisms.

Paper Structure

This paper contains 18 sections, 11 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Visualization of dispatch weight maps during training from scratch at different epochs (20, 60, and 100). The color spectrum ranges from blue to yellow, indicating increasing dispatch weights. The maps illustrate how the expert routing evolves as training progresses over 100 epochs.
  • Figure 2: The process for generating the foreground mask involves leveraging Grounding DINO in conjunction with SAM.
  • Figure 3: Overview of our proposed method. We first compute the average dispatch weights from the Soft MoE module and apply thresholding based on their mean value to generate a binary weight mask. We encourage this weight mask to overlap with the prior foreground mask as much as possible, guiding expert attention toward semantically meaningful regions in the image. Additionally, we introduce a LayerScale module with an initial value of zero, which adaptively regulates the information flow in skip connections during training. As shown at the bottom of the figure, maximizing the overlap between the dispatch masks and the prior masks leads to more diverse and improved specialization among the four selected experts (chosen from a pool of 32 experts).
  • Figure 4: Visualization of dispatch weight maps under different ablation settings. Colors range from blue (low weights) to yellow (high weights), indicating expert assignment intensity. (a) Baseline model without auxiliary loss or LayerScale; (b) Model with only auxiliary loss, without LayerScale; (c) Our full method with auxiliary loss and LayerScale applied at the 8th Soft MoE layer; (d) Our full method with auxiliary loss and LayerScale applied at the 7th Soft MoE layer; (e) Our full method with auxiliary loss and LayerScale applied at both the 7th and 8th Soft MoE layer.
  • Figure 5: Effect of loss weight $\lambda$ on accuracy.
  • ...and 5 more figures