Table of Contents
Fetching ...

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

TL;DR

Lynx addresses the memory bandwidth bottleneck in batched Mixture-of-Experts inference by introducing run-time dynamic expert remapping that is workload-agnostic and calibration-free. It identifies token importance via router confidence, ranks top-$k$ experts by batch-wide significance, and applies phase-aware optimizations to memory-bound decode phases, using lightweight CUDA kernels to remap tokens to a smaller set of active experts. Empirically, Lynx delivers up to 1.23x throughput with minimal accuracy loss (often improving accuracy by up to 4% on several tasks) across multiple model families and benchmarks, and it enhances existing techniques such as offloading and quantization by up to ~1.38x. The approach is designed as a plug-and-play addition to existing LLM serving stacks, offering a practical path to efficient large-scale MoE deployment in production settings.

Abstract

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. Exploiting several key observations that we uncover in this work, LYNX provides a light-weight run-time dynamic expert remapping technique that depends only on information already available in the models. Our evaluation of LYNX on four state-of-the-art model families across nine benchmarks shows that it achieves up to 1.23x improvement in throughput while simultaneously improving accuracy by up to 4% in the majority of the tasks, and incurs only a negligible accuracy loss of less than 1% points in significantly hard tasks. Further, LYNX is complementary to existing techniques where it additionally boosts their performance by up to 1.38x.

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

TL;DR

Lynx addresses the memory bandwidth bottleneck in batched Mixture-of-Experts inference by introducing run-time dynamic expert remapping that is workload-agnostic and calibration-free. It identifies token importance via router confidence, ranks top- experts by batch-wide significance, and applies phase-aware optimizations to memory-bound decode phases, using lightweight CUDA kernels to remap tokens to a smaller set of active experts. Empirically, Lynx delivers up to 1.23x throughput with minimal accuracy loss (often improving accuracy by up to 4% on several tasks) across multiple model families and benchmarks, and it enhances existing techniques such as offloading and quantization by up to ~1.38x. The approach is designed as a plug-and-play addition to existing LLM serving stacks, offering a practical path to efficient large-scale MoE deployment in production settings.

Abstract

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. Exploiting several key observations that we uncover in this work, LYNX provides a light-weight run-time dynamic expert remapping technique that depends only on information already available in the models. Our evaluation of LYNX on four state-of-the-art model families across nine benchmarks shows that it achieves up to 1.23x improvement in throughput while simultaneously improving accuracy by up to 4% in the majority of the tasks, and incurs only a negligible accuracy loss of less than 1% points in significantly hard tasks. Further, LYNX is complementary to existing techniques where it additionally boosts their performance by up to 1.38x.

Paper Structure

This paper contains 28 sections, 2 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Impact of batching on decode latency. Even at modest batch sizes (8-16), all experts are activated, forcing memory fetch of all model parameters. Left: Decode latency increases 2.5× as batch size grows from 1 to 32, as larger number of experts are activated. Right: Number of activated experts rapidly saturates to maximum (8) with increasing batch size.
  • Figure 2: Left: Prefill latency remains relatively constant ($\sim$1100ms) regardless of active experts, shows compute-bound behavior. Right: Decode latency scales linearly with the number of active experts, revealing memory bandwidth as the bottleneck.
  • Figure 3: Comparison of expert activation patterns across different granularities: (Left) aggregate dataset-level uniformity, (Right) batch-level skew for Mixtral 8X7B (Upper) and Qwen2 (Lower) (combined experts for readability)
  • Figure 4: System architecture of Lynx design. Left: Phase-aware optimizer identifies memory-bound inference phases; layer-level components are compiled using CUDA Graphs for maximum performance. Center: Lynx results in much less experts being activated, reducing memory traffic (two out of four activated in this example). Right: Lynx first bins expert choices of each token, then develops batch-wide scores for each expert, selecting the critical experts, and finally remaps expert choice of all tokens to the critical expert set.
  • Figure 5: Impact of token confidence on accuracy for selective expert routing. Reassigning high confidence tokens maintains better accuracy than reassigning low confidence tokens.
  • ...and 11 more figures