Table of Contents
Fetching ...

Unified Sparse Mixture of Experts

Giang Do, Hung Le, Truyen Tran

TL;DR

The paper tackles SMoEs' brittle routing under fixed top-k by recasting expert and token selection as a Linear Programming problem. It introduces Unified Sparse MoE (USMoE), consisting of a Unified Score $f_{USMoE}(S)=\alpha f_e(S)+\beta f_t(S)$ with $\alpha+\beta=1$ and a Unified Mechanism that jointly considers token and expert dimensions to select the most similar token-expert pairs. The authors prove the global optimality of the USMoE routing under a budget constraint and demonstrate that this approach mitigates representation collapse and information leakage common in traditional routing schemes. Extensive experiments across large language models and vision tasks, in training-free, fine-tuning, and pre-training settings, show UP to 10% performance gains or up to 14% inference cost reductions, with robust performance under corruption and across fractional Top-$k$ configurations. These results indicate that USMoE provides a principled, scalable, and robust routing framework for sparse Mixture of Experts in both NLP and computer vision contexts.

Abstract

Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of $k$, where $k$ represents either the number of experts selected per token or the number of tokens assigned per expert. However, these approaches encounter three key limitations: they may fail to route to important experts or tokens, may assign irrelevant ones, and often suffer from representation collapse among experts. This paper reexamines SMoEs through the lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations. Specifically, our approach introduces a unified mechanism that integrates information from both the expert and token dimensions, and a unified scoring function that linearly combines similarity scores between experts and tokens. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness in overcoming the limitations of traditional routing methods. Through comprehensive evaluations on both clean and corrupted settings for large language models and vision tasks, under both training-free and training scenarios, USMoE achieves up to a 10\% performance improvement over standard approaches or reduces inference costs by up to 14\%, while maintaining competitive accuracy.

Unified Sparse Mixture of Experts

TL;DR

The paper tackles SMoEs' brittle routing under fixed top-k by recasting expert and token selection as a Linear Programming problem. It introduces Unified Sparse MoE (USMoE), consisting of a Unified Score with and a Unified Mechanism that jointly considers token and expert dimensions to select the most similar token-expert pairs. The authors prove the global optimality of the USMoE routing under a budget constraint and demonstrate that this approach mitigates representation collapse and information leakage common in traditional routing schemes. Extensive experiments across large language models and vision tasks, in training-free, fine-tuning, and pre-training settings, show UP to 10% performance gains or up to 14% inference cost reductions, with robust performance under corruption and across fractional Top- configurations. These results indicate that USMoE provides a principled, scalable, and robust routing framework for sparse Mixture of Experts in both NLP and computer vision contexts.

Abstract

Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of , where represents either the number of experts selected per token or the number of tokens assigned per expert. However, these approaches encounter three key limitations: they may fail to route to important experts or tokens, may assign irrelevant ones, and often suffer from representation collapse among experts. This paper reexamines SMoEs through the lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations. Specifically, our approach introduces a unified mechanism that integrates information from both the expert and token dimensions, and a unified scoring function that linearly combines similarity scores between experts and tokens. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness in overcoming the limitations of traditional routing methods. Through comprehensive evaluations on both clean and corrupted settings for large language models and vision tasks, under both training-free and training scenarios, USMoE achieves up to a 10\% performance improvement over standard approaches or reduces inference costs by up to 14\%, while maintaining competitive accuracy.

Paper Structure

This paper contains 30 sections, 3 theorems, 28 equations, 12 figures, 16 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $S \in \mathbb{R}^{T \times N}$ be the compatibility score matrix and $c \in \mathbb{N}$ a global routing budget. Consider the objective: subject to the constraint $\sum_{i,j} x_{ij} \leq c$, with $x_{ij} \in \{0, 1\}$. Let $X_{\text{USMoE}} = \text{TopK}(S, c)$ be the binary mask produced by selecting the top-$c$ entries of $S$ globally. Then for any other feasible binary routing matrix $X_T

Figures (12)

  • Figure 1: We compare token routing performance in vision tasks using 7×7 images, where each token is color-coded based on its assigned expert under setting $c=t$, where $c$ is a sparsity constraint and $t$ is number of image patches. Left: When the object is easy to distinguish, Expert Choice (EC) fails to assign different experts appropriately. Token Choice (TC) performs better but still does not align perfectly with the actual object, while USMoE correctly separates the object. Right: In more challenging images, both Expert Choice (EC) and Token Choice (TC) fail to distinguish between object and background. In contrast, USMoE successfully differentiates the object from the background, demonstrating greater efficiency in vision tasks compared to EC and TC, as further shown in Section \ref{['sec:exp']}.
  • Figure 2: An illustration of our USMoE selection mechanism (middle) is shown, which incorporates information from both the token and expert dimensions. In contrast, Token Choice (TC, left) considers only the expert dimension, while Expert Choice (EC, right) focuses solely on the token dimension. Tokens marked with # represent noisy or irrelevant tokens. TC struggles to handle these # tokens effectively, and EC is prone to missing important tokens. USMoE addresses both issues, making our method more robust than traditional MoE approaches. This robustness is demonstrated in both theoretical analysis and experimental results. 'E' denotes experts. Best viewed in color.
  • Figure 3: The performance of Unified Score (USMoE), Softmax (TC), and Sigmoid (EC) across the MTEB benchmark. Sigmoid outperforms Softmax on tasks without prompting, indicating stronger semantic representations.
  • Figure 4: Performance comparison of USMoE, Token Choice (TC), Expert Choice (EC), and MoEE across MTEB Tasks and advance SMoE models. The best result for each row is highlighted in bold. Best viewed in color.
  • Figure 5: Illustration of comparing the performance of USMoE, Token Choice (TC), Expert Choice (EC) using QwenMoE and OLMoE for the Supervised Fine-Tuning task on Apaca dataset for 2K steps under both clean and corrupted settings. Training and validation perplexity over training steps are reported, and lower values are better.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Proposition 3.1
  • Definition 3.2: Unified Score Function
  • Lemma 3.3
  • Definition A.1: General Routing Function $\mathcal{R}$
  • Lemma A.2
  • proof
  • proof
  • proof