Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Vima Gupta; Jae Hyung Ju; Kartik Sinha; Ada Gavrilovska; Anand Padmanabha Iyer

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

TL;DR

Lynx addresses the memory bandwidth bottleneck in batched Mixture-of-Experts inference by introducing run-time dynamic expert remapping that is workload-agnostic and calibration-free. It identifies token importance via router confidence, ranks top-$k$ experts by batch-wide significance, and applies phase-aware optimizations to memory-bound decode phases, using lightweight CUDA kernels to remap tokens to a smaller set of active experts. Empirically, Lynx delivers up to 1.23x throughput with minimal accuracy loss (often improving accuracy by up to 4% on several tasks) across multiple model families and benchmarks, and it enhances existing techniques such as offloading and quantization by up to ~1.38x. The approach is designed as a plug-and-play addition to existing LLM serving stacks, offering a practical path to efficient large-scale MoE deployment in production settings.

Abstract

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. Exploiting several key observations that we uncover in this work, LYNX provides a light-weight run-time dynamic expert remapping technique that depends only on information already available in the models. Our evaluation of LYNX on four state-of-the-art model families across nine benchmarks shows that it achieves up to 1.23x improvement in throughput while simultaneously improving accuracy by up to 4% in the majority of the tasks, and incurs only a negligible accuracy loss of less than 1% points in significantly hard tasks. Further, LYNX is complementary to existing techniques where it additionally boosts their performance by up to 1.38x.

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

TL;DR

Abstract

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)