Table of Contents
Fetching ...

Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

Chaoxiang Cai, Longrong Yang, Minghe Weng, Xuewei Li, Zequn Qin, Xi Li

Abstract

The mixture-of-experts (MoE) architecture, which replaces dense networks with sparse ones, has attracted significant attention in large vision-language models (LVLMs) for achieving comparable performance while activating far fewer parameters. Existing MoE architectures for LVLMs primarily focus on token-to-expert routing (TER), encouraging different experts to specialize in processing specific tokens. However, these methods typically rely on the load balancing mechanism, neglecting the inherent distributional differences between vision and language modalities. To address this limitation, we propose the Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, which tackles two key challenges: (1) Modality-specific distribution-aware routing. We observe that language TER generally follows a relatively uniform distribution, whereas vision TER exhibits a long-tailed distribution. This modality discrepancy motivates the design of specialized routing strategies for each modality. (2) Vision-specific dynamic expert activation. Recognizing the importance of high-information vision tail tokens, we introduce a data-augmentation-inspired strategy that increases the number of activated experts, ensuring sufficient learning for these rare but informative tokens. On vision-language and vision benchmarks, our approach achieves consistent improvements, boosting performance by 1.2% / 2.1% on vision-language and 1.6% on vision benchmarks.

Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

Abstract

The mixture-of-experts (MoE) architecture, which replaces dense networks with sparse ones, has attracted significant attention in large vision-language models (LVLMs) for achieving comparable performance while activating far fewer parameters. Existing MoE architectures for LVLMs primarily focus on token-to-expert routing (TER), encouraging different experts to specialize in processing specific tokens. However, these methods typically rely on the load balancing mechanism, neglecting the inherent distributional differences between vision and language modalities. To address this limitation, we propose the Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, which tackles two key challenges: (1) Modality-specific distribution-aware routing. We observe that language TER generally follows a relatively uniform distribution, whereas vision TER exhibits a long-tailed distribution. This modality discrepancy motivates the design of specialized routing strategies for each modality. (2) Vision-specific dynamic expert activation. Recognizing the importance of high-information vision tail tokens, we introduce a data-augmentation-inspired strategy that increases the number of activated experts, ensuring sufficient learning for these rare but informative tokens. On vision-language and vision benchmarks, our approach achieves consistent improvements, boosting performance by 1.2% / 2.1% on vision-language and 1.6% on vision benchmarks.

Paper Structure

This paper contains 30 sections, 12 equations, 11 figures, 21 tables.

Figures (11)

  • Figure 1: (a) Our goal is to ensure that vision tail tokens are sufficiently learned within specialized experts. (b) Distribution of TER probability variance. Language TER with load balancing is uniform, vision TER without load balancing exhibits a long-tailed distribution and vision TER with load balancing shows a biased long-tailed characteristic. (c) GMoE with and without load balancing. Removing load balancing from vision tokens improves performance.
  • Figure 2: (a) Modality-specific distribution-aware router allows vision and language to be routed with different expert load to adapt to their respective modality distributions. (b) Vision-specific dynamic expert activation enables a data-augmentation strategy to make experts process important vision tail tokens sufficiently.
  • Figure 3: Expert load of MoE-LLaVA with StableLM-1.6B and LTDR on MME. LTDR does not significantly increase the load on the slowest experts.
  • Figure 4: Expert token load across layers. Bar heights indicate token proportions assigned to experts. LTDR yields a more balanced expert utilization.
  • Figure 5: Expert token load cross modal. Bar heights indicate token proportions assigned to experts. LTDR yields more balanced cross-modal expert utilization.
  • ...and 6 more figures