Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Yilong Chen; Naibin Gu; Junyuan Shang; Zhenyu Zhang; Yuchen Feng; Jiawei Sheng; Tingwen Liu; Shuohuan Wang; Yu Sun; Hua Wu; Haifeng Wang

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

TL;DR

This work proposes Mixture of Universal Experts (MOUE), a MoE generalization introducing a novel scaling dimension: Virtual Width, which outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Abstract

Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

TL;DR

Abstract

Paper Structure (33 sections, 16 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 16 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Preliminary
Motivation: Case for Parameter Reuse
The Mixture of Universal Experts (MoUE)
General Framework
Structural Optimization: Staggered Connectivity
Universal Expert Load Balance (UELB)
Universal Router
Progressive Warm-Start
Experiments
Experimental Setup
Main Result
Ablation Studies
Analysis
Related Work
...and 18 more sections

Figures (10)

Figure 1: Overview of MoUE. A shared pool of Universal Experts is accessible from multiple layers, enabling recursive reuse under a fixed activation budget.
Figure 2: Expert similarity heatmap across shallow (L2, L3) and deep (L8, L9) layers in a trained MoE. Adjacent layers exhibit strong similarity, and specific experts re-emerge across distant layers, motivating cross-layer expert reuse.
Figure 3: An illustration of MoUE. Each layer routes tokens to a small set of experts, consisting of layer-local experts and a shared pool of Universal Experts (UEs). UEs are connected to multiple layers (under a constrained topology), enabling cross-layer reuse while keeping the per-token activation budget fixed.
Figure 4: LBL only focuses on balancing expert selection within each layer, while ULBL promotes balanced and diverse expert selection both across depth and width.
Figure 5: Left: LM training loss versus training steps. Middle: Validation loss versus training FLOPs. Right: Max/Mean ratio versus training steps (routing skew / load-balance indicator).
...and 5 more figures

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

TL;DR

Abstract

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)