Table of Contents
Fetching ...

Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su

TL;DR

The paper addresses the brittleness of fixed- k training in MoE models when inference-time elasticity is desired. It introduces Matryoshka MoE (M-MoE), which trains with variable numbers of active experts to enforce a coarse-to-fine hierarchical gating, with layer-wise stochasticity proving most effective. Experiments on a 20B MoE demonstrate that a single M-MoE model matches or exceeds the performance of an ensemble of specialist models across $k \in [1,6]$, and enables novel layer-wise inference budgets and analysis of router behavior, including nested rankings and expert specialization. This approach offers practical, cost-efficient elastic inference for large-scale MoEs and informs deployment strategies that allocate computational budgets across layers for performance-efficiency trade-offs.

Abstract

Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.

Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

TL;DR

The paper addresses the brittleness of fixed- k training in MoE models when inference-time elasticity is desired. It introduces Matryoshka MoE (M-MoE), which trains with variable numbers of active experts to enforce a coarse-to-fine hierarchical gating, with layer-wise stochasticity proving most effective. Experiments on a 20B MoE demonstrate that a single M-MoE model matches or exceeds the performance of an ensemble of specialist models across , and enables novel layer-wise inference budgets and analysis of router behavior, including nested rankings and expert specialization. This approach offers practical, cost-efficient elastic inference for large-scale MoEs and informs deployment strategies that allocate computational budgets across layers for performance-efficiency trade-offs.

Abstract

Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.

Paper Structure

This paper contains 36 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: MMLU score of DeepSeek-V2-Lite, Qwen3-30B-A3B-Base, and RedNote-Dots.LLM1.Base under varying numbers of activated experts.
  • Figure 2: Heatmaps illustrating the router's expert ranking consistency for the Top-k ($k=6$) model (top) and our M-MoE-Layer model (bottom). A bright color signifies a high correlation, indicating a strong nested, Matryoshka-like ranking structure.
  • Figure 3: Comparison of MODS for the Top-k and our model. Lower MODS indicates greater expert specialization.
  • Figure 4: MMLU score of the M-MoE-layer model evaluated at different inference expert counts (k=1, 2, 4, 6) throughout continual pre-training. The x-axis represents training steps from the start of M-MoE training.