Table of Contents
Fetching ...

Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing

Yuanteng Chen, Peisong Wang, Yuantian Shao, Nanxin Zeng, Chang Xu, Jian Cheng

TL;DR

This work targets inefficiencies in routing for fine-grained Mixture-of-Experts LLMs, where pre-training routing saturates early and balanced activation suppresses the potential of specialized experts. It introduces Ban&Pick, a post-training, plug-and-play framework consisting of Pick (amplify key domain-specialized experts) and Ban (dynamically prune redundant experts) to improve accuracy and inference speed without retraining. Empirical results across math, code, and general reasoning benchmarks on DeepSeek and Qwen3 MoE-LMs show that Pick alone yields average gains (e.g., ~2.83%), Ban delivers notable speedups with minimal accuracy loss, and Ban&Pick combines both improvements (e.g., ~1.99% accuracy gain with ~1.25x speedup on Qwen3-30B-A3B; ~1.33% with ~1.26x on Qwen3-235B-A22B). The approach demonstrates practical benefits by identifying and leveraging key experts while trimming redundancy, offering a scalable path to more effective and efficient MoE inference without architectural changes.

Abstract

Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency at inference. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban further dynamically prunes redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban\&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing

TL;DR

This work targets inefficiencies in routing for fine-grained Mixture-of-Experts LLMs, where pre-training routing saturates early and balanced activation suppresses the potential of specialized experts. It introduces Ban&Pick, a post-training, plug-and-play framework consisting of Pick (amplify key domain-specialized experts) and Ban (dynamically prune redundant experts) to improve accuracy and inference speed without retraining. Empirical results across math, code, and general reasoning benchmarks on DeepSeek and Qwen3 MoE-LMs show that Pick alone yields average gains (e.g., ~2.83%), Ban delivers notable speedups with minimal accuracy loss, and Ban&Pick combines both improvements (e.g., ~1.99% accuracy gain with ~1.25x speedup on Qwen3-30B-A3B; ~1.33% with ~1.26x on Qwen3-235B-A22B). The approach demonstrates practical benefits by identifying and leveraging key experts while trimming redundancy, offering a scalable path to more effective and efficient MoE inference without architectural changes.

Abstract

Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency at inference. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban further dynamically prunes redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban\&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

Paper Structure

This paper contains 40 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The intuition and empirical findings on expert utilization. Left (Pick): forcibly activating key experts (e.g., E43) improves accuracy on Winogrande sakaguchi2021winogrande. Right (Ban): reducing experts from 6 to 5 cuts 16.7% computation without accuracy loss.
  • Figure 2: Expert specialization in fine-grained MoE (Qwen3‑30B‑A3B). Left: expert usage frequency across three tasks (math, general, code). Right: token word clouds for 3 high-frequency experts.
  • Figure 3: Impact of pruning (measured by KL divergence, left axis) and enhancing (by accuracy gain, right axis) for domain‑specialized experts.
  • Figure 4: Comparison of five designed enhancement methods for key experts, evaluated by accuracy on three widely used math benchmarks.
  • Figure 5: Sensitivity analysis of MoE experts: (a) layer-wise and (b) token-wise. Both dimensions exhibit large variance, motivating a dynamic pruning strategy.
  • ...and 4 more figures