Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing

Yuanteng Chen; Peisong Wang; Yuantian Shao; Nanxin Zeng; Chang Xu; Jian Cheng

Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing

Yuanteng Chen, Peisong Wang, Yuantian Shao, Nanxin Zeng, Chang Xu, Jian Cheng

TL;DR

This work targets inefficiencies in routing for fine-grained Mixture-of-Experts LLMs, where pre-training routing saturates early and balanced activation suppresses the potential of specialized experts. It introduces Ban&Pick, a post-training, plug-and-play framework consisting of Pick (amplify key domain-specialized experts) and Ban (dynamically prune redundant experts) to improve accuracy and inference speed without retraining. Empirical results across math, code, and general reasoning benchmarks on DeepSeek and Qwen3 MoE-LMs show that Pick alone yields average gains (e.g., ~2.83%), Ban delivers notable speedups with minimal accuracy loss, and Ban&Pick combines both improvements (e.g., ~1.99% accuracy gain with ~1.25x speedup on Qwen3-30B-A3B; ~1.33% with ~1.26x on Qwen3-235B-A22B). The approach demonstrates practical benefits by identifying and leveraging key experts while trimming redundancy, offering a scalable path to more effective and efficient MoE inference without architectural changes.

Abstract

Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency at inference. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban further dynamically prunes redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban\&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing

TL;DR

Abstract

Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)