Unified Sparse Mixture of Experts

Giang Do; Hung Le; Truyen Tran

Unified Sparse Mixture of Experts

Giang Do, Hung Le, Truyen Tran

TL;DR

The paper tackles SMoEs' brittle routing under fixed top-k by recasting expert and token selection as a Linear Programming problem. It introduces Unified Sparse MoE (USMoE), consisting of a Unified Score $f_{USMoE}(S)=\alpha f_e(S)+\beta f_t(S)$ with $\alpha+\beta=1$ and a Unified Mechanism that jointly considers token and expert dimensions to select the most similar token-expert pairs. The authors prove the global optimality of the USMoE routing under a budget constraint and demonstrate that this approach mitigates representation collapse and information leakage common in traditional routing schemes. Extensive experiments across large language models and vision tasks, in training-free, fine-tuning, and pre-training settings, show UP to 10% performance gains or up to 14% inference cost reductions, with robust performance under corruption and across fractional Top-$k$ configurations. These results indicate that USMoE provides a principled, scalable, and robust routing framework for sparse Mixture of Experts in both NLP and computer vision contexts.

Abstract

Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of $k$, where $k$ represents either the number of experts selected per token or the number of tokens assigned per expert. However, these approaches encounter three key limitations: they may fail to route to important experts or tokens, may assign irrelevant ones, and often suffer from representation collapse among experts. This paper reexamines SMoEs through the lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations. Specifically, our approach introduces a unified mechanism that integrates information from both the expert and token dimensions, and a unified scoring function that linearly combines similarity scores between experts and tokens. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness in overcoming the limitations of traditional routing methods. Through comprehensive evaluations on both clean and corrupted settings for large language models and vision tasks, under both training-free and training scenarios, USMoE achieves up to a 10\% performance improvement over standard approaches or reduces inference costs by up to 14\%, while maintaining competitive accuracy.

Unified Sparse Mixture of Experts

TL;DR

with

and a Unified Mechanism that jointly considers token and expert dimensions to select the most similar token-expert pairs. The authors prove the global optimality of the USMoE routing under a budget constraint and demonstrate that this approach mitigates representation collapse and information leakage common in traditional routing schemes. Extensive experiments across large language models and vision tasks, in training-free, fine-tuning, and pre-training settings, show UP to 10% performance gains or up to 14% inference cost reductions, with robust performance under corruption and across fractional Top-

configurations. These results indicate that USMoE provides a principled, scalable, and robust routing framework for sparse Mixture of Experts in both NLP and computer vision contexts.

Abstract

Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of

, where

represents either the number of experts selected per token or the number of tokens assigned per expert. However, these approaches encounter three key limitations: they may fail to route to important experts or tokens, may assign irrelevant ones, and often suffer from representation collapse among experts. This paper reexamines SMoEs through the lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations. Specifically, our approach introduces a unified mechanism that integrates information from both the expert and token dimensions, and a unified scoring function that linearly combines similarity scores between experts and tokens. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness in overcoming the limitations of traditional routing methods. Through comprehensive evaluations on both clean and corrupted settings for large language models and vision tasks, under both training-free and training scenarios, USMoE achieves up to a 10\% performance improvement over standard approaches or reduces inference costs by up to 14\%, while maintaining competitive accuracy.

Unified Sparse Mixture of Experts

TL;DR

Abstract

Unified Sparse Mixture of Experts

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (8)