Table of Contents
Fetching ...

Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

Charafeddine Mouzouni

Abstract

We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.

Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

Abstract

We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.

Paper Structure

This paper contains 57 sections, 8 theorems, 19 equations, 2 figures, 6 tables.

Key Result

Proposition 2.2

The MFG equilibrium with linear congestion and entropy regularization exists, is unique, and lies in the interior of $\Delta_M$ (all experts receive positive load). $\blacktriangleleft$$\blacktriangleleft$

Figures (2)

  • Figure 1: Effective congestion $\gamma_{\mathrm{eff}}$ across 20 training checkpoints of OLMoE-1B-7B. The three-phase trajectory---surge, stabilization, relaxation---is the paper's central finding. Shaded band: 95% bootstrap CIs (where available). Open circles: dense-sample checkpoints (20 texts, no CI). The inverted-U shape, with a ${\geq}\,4.2\times$ peak-to-final ratio, is invisible to analysis of the converged model alone.
  • Figure 2: Robustness of the three-phase trajectory to quality estimation method. All four estimators reproduce the surge--stabilization--relaxation pattern ($r \geq 0.89$ vs. default mean). The three-phase finding is not an artifact of the quality proxy.

Theorems & Definitions (30)

  • Definition 2.1: MFG equilibrium
  • Proposition 2.2: Existence, uniqueness, interiority
  • proof
  • Remark 2.3: MoE isomorphism
  • Theorem 2.4: Softmax equivalence
  • proof
  • Remark 2.5: Significance
  • Definition 3.1: Effective congestion
  • Theorem 3.2: Identification
  • proof
  • ...and 20 more