Table of Contents
Fetching ...

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

TL;DR

This work proposes Mixture of Universal Experts (MOUE), a MoE generalization introducing a novel scaling dimension: Virtual Width, which outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

TL;DR

This work proposes Mixture of Universal Experts (MOUE), a MoE generalization introducing a novel scaling dimension: Virtual Width, which outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.
Paper Structure (33 sections, 16 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 16 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of MoUE. A shared pool of Universal Experts is accessible from multiple layers, enabling recursive reuse under a fixed activation budget.
  • Figure 2: Expert similarity heatmap across shallow (L2, L3) and deep (L8, L9) layers in a trained MoE. Adjacent layers exhibit strong similarity, and specific experts re-emerge across distant layers, motivating cross-layer expert reuse.
  • Figure 3: An illustration of MoUE. Each layer routes tokens to a small set of experts, consisting of layer-local experts and a shared pool of Universal Experts (UEs). UEs are connected to multiple layers (under a constrained topology), enabling cross-layer reuse while keeping the per-token activation budget fixed.
  • Figure 4: LBL only focuses on balancing expert selection within each layer, while ULBL promotes balanced and diverse expert selection both across depth and width.
  • Figure 5: Left: LM training loss versus training steps. Middle: Validation loss versus training FLOPs. Right: Max/Mean ratio versus training steps (routing skew / load-balance indicator).
  • ...and 5 more figures