Table of Contents
Fetching ...

Routing-Free Mixture-of-Experts

Yilun Liu, Jinru Han, Sikuan Yan, Volker Tresp, Yunpu Ma

Abstract

Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.

Routing-Free Mixture-of-Experts

Abstract

Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.

Paper Structure

This paper contains 27 sections, 27 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Standard MoE relies on routing to orchestrate expert activations. Routing-Free MoE let each expert purely-independently determine its own activation. Green indicates activated components; red for inactive components; yellow for trainable components.
  • Figure 2: Routing-Free MoE consistently outperforms standard MoE, AoE lv2025aoe, and ReMoE wang2024remoe in language modeling. All models are trained on OpenWebText Gokaslan2019OpenWeb under identical environment conditions and best-performing configurations, as described in Section \ref{['sec:exp-setup']}. FLOPs are estimated for one epoch.
  • Figure 3: Load-balancing for tokens and experts. Routing-Free MoE introduces a unified load-balancing framework that simultaneously optimizes both expert-balancing and token-balancing through a configurable interpolation.
  • Figure 4: Training dynamics of Routing-Free MoE at scale S, with $r=16$, $\lambda_0=1\mathrm{e}^{-10}$, $\eta=0.02$, and $\alpha=1\mathrm{e}^{-3}$.
  • Figure 5: Training dynamics of Routing-Free MoE by $\alpha$, with $r=16$, $\lambda_0=1\mathrm{e}^{-10}$, and $\eta=0.02$.
  • ...and 3 more figures