Table of Contents
Fetching ...

SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan

TL;DR

SERE tackles the tension between batched decoding and sparse activation in MoE models by leveraging similarity among experts to dynamically re-route tokens away from redundant secondary experts to their most similar primary counterparts. It precomputes layer-wise expert similarity matrices from calibration data and preserves critical experts to protect capacity, while introducing a fast CUDA kernel for seamless integration with vLLM. Across multiple MoE models and OpenCompass benchmarks, SERE achieves up to 2x decoding speedups with minimal loss in accuracy, outperforming static pruning and other dynamic skipping methods in practical deployment scenarios. The approach demonstrates how structured intra-model redundancy can be exploited to improve throughput without retraining, paving the way for cost-efficient and latency-sensitive large-scale MoE deployments.

Abstract

Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.

SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

TL;DR

SERE tackles the tension between batched decoding and sparse activation in MoE models by leveraging similarity among experts to dynamically re-route tokens away from redundant secondary experts to their most similar primary counterparts. It precomputes layer-wise expert similarity matrices from calibration data and preserves critical experts to protect capacity, while introducing a fast CUDA kernel for seamless integration with vLLM. Across multiple MoE models and OpenCompass benchmarks, SERE achieves up to 2x decoding speedups with minimal loss in accuracy, outperforming static pruning and other dynamic skipping methods in practical deployment scenarios. The approach demonstrates how structured intra-model redundancy can be exploited to improve throughput without retraining, paving the way for cost-efficient and latency-sensitive large-scale MoE deployments.

Abstract

Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.
Paper Structure (35 sections, 1 theorem, 19 equations, 14 figures, 13 tables, 2 algorithms)

This paper contains 35 sections, 1 theorem, 19 equations, 14 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

We consider replacing a single expert $\mathbf{E}^{(i)}_a$ in layer $i$ with another expert $\widetilde{\mathbf{E}}^{(i)}_a$ while keeping all other experts and routing weights unchanged, yielding a modified layer $\widetilde{\mathcal{N}}_i$. Let $\mathcal{F} = \mathcal{N}_k \circ \cdots \circ \math where the substitution error is

Figures (14)

  • Figure 1: Larger batches activate more experts. With a fixed batch size, more experts increase decoding time.
  • Figure 2: Visualizations of SERE’s Performance. (a) Across all tasks, SERE ($K$=2) exhibits negligible performance loss, while SERE ($K$=1) still outperforms all baselines. (b) SERE significantly reduces batch decoding time, achieving up to 2$\times$ acceleration.
  • Figure 3: Illustration of SERE with $4$ tokens and $4$ experts as example. Tokens are first routed to top-$2$ experts. SERE preserves the primary experts (1 and 4) and re-routes the secondary experts (2 and 3). As a result, Expert 2 is replaced by Expert 1, while Expert 3 remains active as its similarity to all active experts falls below the threshold.
  • Figure 4: Visualization of the expert similarity matrices and the average expert similarity across all layers in Qwen3-30B-A3B yang2025qwen3.
  • Figure 5: Weights Distribution
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 1: MoE Layer Structure
  • Definition 2: Expert Similarity
  • Theorem 1: Expert Substitution Error Bound
  • proof