Table of Contents
Fetching ...

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

Shwai He, Weilin Cai, Jiayi Huang, Ang Li

TL;DR

This work identifies the Straggler Effect in Mixture of Experts (MoE) inference, where heavily loaded experts bottleneck latency due to imbalanced token assignment. It proposes Capacity-Aware Token Drop to cap per-expert load with $C = \gamma \bar{N}$ and a Score-based token discard, plus Capacity-Aware Expanded Drop to widen the local candidate set to $k+m$ for underutilized experts. The methods yield substantial end-to-end speedups (up to $1.87\times$ per-layer in some setups) with minimal accuracy loss (e.g., $0.2\%$ average improvement on Mixtral-8$\times$7B-Instruct), and extend to multimodal MoE where image-token redundancy enables aggressive dropping. Together, the results offer practical guidance for reducing inference latency and improving resource utilization in language and multimodal MoE deployments.

Abstract

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup.

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

TL;DR

This work identifies the Straggler Effect in Mixture of Experts (MoE) inference, where heavily loaded experts bottleneck latency due to imbalanced token assignment. It proposes Capacity-Aware Token Drop to cap per-expert load with and a Score-based token discard, plus Capacity-Aware Expanded Drop to widen the local candidate set to for underutilized experts. The methods yield substantial end-to-end speedups (up to per-layer in some setups) with minimal accuracy loss (e.g., average improvement on Mixtral-87B-Instruct), and extend to multimodal MoE where image-token redundancy enables aggressive dropping. Together, the results offer practical guidance for reducing inference latency and improving resource utilization in language and multimodal MoE deployments.

Abstract

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., speedup with only degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-87B-Instruct yields a {0.2\%} average performance improvement and a {1.85} inference speedup.

Paper Structure

This paper contains 31 sections, 12 equations, 19 figures, 7 tables, 2 algorithms.

Figures (19)

  • Figure 1: Illustration of the Straggler Effect in MoE Inference, where the most burdened experts dictate the overall latency.
  • Figure 2: Expert-wise load, where each load value is divided by $\bar{N}$ for clarity. To ensure generality, we visualize loads across different datasets.
  • Figure 3: Illustration of Capacity-Aware Token Drop (a) and Expanded Drop (b). Both methods first select experts based on gating scores. In Token Drop, tokens exceeding the local device capacity are discarded prior to All-to-All communication. Expanded Drop enhances expert utilization by allowing each token to consider additional $m$ candidate experts on the same device while still enforcing strict local capacity constraints.
  • Figure 4: Speedup of a single MoE layer compared to the baseline without capacity constraints, achieved through two capacity-aware inference methods: Token Drop and Expanded Drop.
  • Figure 5: End-to-End Model Speedup
  • ...and 14 more figures