Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference
Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui
TL;DR
The paper tackles online inference for Sparse Mixture of Experts (SMoE) models, where explicit task tags are often unavailable and full MoE routing incurs high latency and memory costs. It introduces Tanbr, a tree-structured adaptive neural bandit router that performs task-aware expert merging by estimating task distributions from history and optimizing merging weights in a continuous space using a neural bandit with UCB guidance. The approach provides sublinear regret guarantees and demonstrates substantial practical gains, reducing latency by at least 45% and memory by up to 25% while maintaining accuracy across multiple tasks and architectures. These results enable efficient deployment of MoE models on edge and resource-constrained environments, with solid theoretical foundations and strong empirical performance.
Abstract
Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \texttt{Tanbr} achieves a sublinear regret bound of {\small $\mathcal{O}(\sqrt{T} \log(T))$} over {\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \texttt{Tanbr} reduces inference latency by at least {\small $45\%$} and memory usage by up to {\small $25\%$}, while maintaining a high accuracy compared to many state-of-the-art methods.
