Table of Contents
Fetching ...

A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs

Zijie Liu, Jie Peng, Jinhao Duan, Zirui Liu, Kaixiong Zhou, Mingfu Liang, Luke Simon, Xi Liu, Zhaozhuo Xu, Tianlong Chen

TL;DR

A systematic analysis of expert routing during inference is presented and three findings are identified: load imbalance persists and worsens with larger batch sizes, selection frequency does not reliably reflect expert importance, and overall expert workload and importance can be estimated using a small calibration set.

Abstract

Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (R&Q), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference.

A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs

TL;DR

A systematic analysis of expert routing during inference is presented and three findings are identified: load imbalance persists and worsens with larger batch sizes, selection frequency does not reliably reflect expert importance, and overall expert workload and importance can be estimated using a small calibration set.

Abstract

Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (R&Q), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference.
Paper Structure (17 sections, 3 equations, 13 figures, 3 tables, 3 algorithms)

This paper contains 17 sections, 3 equations, 13 figures, 3 tables, 3 algorithms.

Figures (13)

  • Figure 1: Overview of the proposed Replicate-and-Quantize (R&Q) framework. Compared to vanilla SMoE inference, which suffers from severe load imbalance due to overused heavy-hitter experts, R&Q mitigates this issue by replicating high-load experts with quantized copies while simultaneously quantizing less important ones within the same memory budget. This achieves more balanced expert utilization and faster inference without requiring model retraining.
  • Figure 2: Allocated tokens versus the inverse of the pruning-based importance score for experts in the first MoE block of LLaMA-MoE on the PIQA dataset. Each point represents one expert. The weak correlation between token allocation and inverse importance illustrates that heavily utilized experts are not necessarily the most important for task performance, motivating the decoupling of routing frequency and expert importance in our R&Q design.
  • Figure 3: Effect of Batch Size on Load Imbalance. This figure reports the LIS (Definition \ref{['def:lb_score']}) for six representative tasks when evaluated under batch sizes of 1 and 32 using the Switch Transformer (8 experts). Across all datasets, larger batch sizes consistently amplify load imbalance, as independently routed tokens increasingly concentrate on a small subset of experts. This highlights the scalability limitation of static routing and underscores the need for inference-time adaptation such as R&Q to maintain balanced expert utilization.
  • Figure 4: Overview of the Replicate-and-Quantize (R&Q) framework compared to vanilla MoE. (a) Vanilla MoE exhibits inference-time load imbalance, where a small number of heavy-hitter experts receive a disproportionate share of tokens due to the router’s allocation pattern. (b) The proposed R&Q method identifies heavy-hitter and less important experts using a small calibration set, replicates the former, and quantizes both the latter and the replicas under the same memory budget. During inference, the replicated experts mitigate routing bottlenecks and reduce token concentration, leading to more balanced expert utilization without requiring any additional training.
  • Figure 5: Evaluating expert removal strategies in SMoEs. Accuracy (%) after removing one expert per layer, comparing our Wanda-based selection (“Ours”) against Random and Heavy-hitter baselines, with Raw as the unmodified model. Across both LLaMA-MoE and Switch Transformer (8 Experts), our method consistently preserves or improves accuracy, indicating more reliable identification of less-important experts. This confirms that importance-aware pruning maintains task fidelity while offering additional compression headroom.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 1: Load Imbalance Score