Table of Contents
Fetching ...

From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

Rana Shahout, Colin Cai, Yilun Du, Minlan Yu, Michael Mitzenmacher

TL;DR

The paper tackles the memory and latency challenges of inference in large Mixture-of-Experts models by addressing load imbalance across experts. It introduces LASER, a plug-and-play, inference-time routing algorithm that adapts to the gate-score distribution and real-time loads to balance utilization without retraining. LASER expands or constrains the candidate pool based on score distribution, then assigns tokens to the least-loaded experts, with per-layer parameterization to match layer-specific score patterns. Empirical results on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four benchmarks show substantial reductions in expert imbalance and improved latency/throughput while maintaining near-baseline accuracy, highlighting LASER’s practical impact for scalable MoE inference.

Abstract

Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate's score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.

From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

TL;DR

The paper tackles the memory and latency challenges of inference in large Mixture-of-Experts models by addressing load imbalance across experts. It introduces LASER, a plug-and-play, inference-time routing algorithm that adapts to the gate-score distribution and real-time loads to balance utilization without retraining. LASER expands or constrains the candidate pool based on score distribution, then assigns tokens to the least-loaded experts, with per-layer parameterization to match layer-specific score patterns. Empirical results on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four benchmarks show substantial reductions in expert imbalance and improved latency/throughput while maintaining near-baseline accuracy, highlighting LASER’s practical impact for scalable MoE inference.

Abstract

Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate's score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.

Paper Structure

This paper contains 18 sections, 4 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Layer-wise gate score distribution variability in Mixtral-8$\times$7B. Rows correspond to datasets (GSM8K, MMLU, Wiki).
  • Figure 2: Comparison of routing strategies. Each figure shows 5 experts with $k=2$ experts per token; the icon above each expert indicates load. Left (Vanilla): top-$k$ routing always picks the two highest-scoring experts (e.g., experts 1 and 2), even if they are overloaded. Right ( LASER): routing adapts to score distribution and load. For token 1, the skewed distribution leads to the top-2 choice; for token 2, the uniform distribution lets LASER assign to the least-loaded experts; for token 3, expert 3’s score is close to expert 2 and it is less loaded, so LASER selects experts 1 and 3.
  • Figure 3: Per-layer max violation (MV) on Mixtral-8$\times$7B ($k=2$) across GSM8K, MMLU, ARC-Easy, and ARC-Challenge. We compare LASER with candidate pool sizes $c=2,3,4$ against the load-only lower bound (MV=0). Increasing $c$ reduces MV across most layers, with the largest improvements in middle layers where gate score distributions are flatter. In contrast, the final layers have high top-$M_2$ mass; our setting of $\varepsilon_{\text{high}}$ disables expansion in these layers, so MV remains unchanged but accuracy is preserved.
  • Figure 4: Expert utilization in Mixtral-8$\times$7B on GSM8K for different candidate pool sizes ($c=2,3,4$). For $c=2$, token assignments concentrate on a few experts, leading to imbalance. As $c$ increases, tokens spread more evenly across experts, producing smoother utilization patterns.
  • Figure 5: Mixtral-8$\times$7B ($k=2$). LASER maintains accuracy while reducing imbalance ($I_{\text{agg}}$) across datasets. The largest improvement appears on GSM8K (up to $1.63\times$ reduction in mean $I_{\text{agg}}$). When $c=k$, LASER matches vanilla top-$k$ routing.
  • ...and 7 more figures