Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection
Jing Li, Zhijie Sun, Dachao Lin, Xuan He, Binfan Zheng, Yi Lin, Rongqian Zhao, Xin Chen
TL;DR
ETR introduces Expert-Token Resonance, a bidirectional routing framework for MoE models that dynamically combines token-choice routing and expert-choice routing during training. By leveraging Grouped Average Pooling based affinity (GrAP), a cosine-similarity based affinity metric, a locality loss, and an adaptive expert-capacity strategy, ETR reduces communication bottlenecks and prevents expert homogenization. The approach yields substantial end-to-end training efficiency gains (up to ~46.6%) and consistent improvements on downstream tasks across diverse benchmarks, while also lowering memory usage and improving load balance. This unified routing paradigm enables scalable, efficient MoE deployments with stronger specialization and practical applicability to large-scale LLMs.
Abstract
Mixture-of-Experts (MoE) architectures enable efficient scaling of large language models by activating only a subset of parameters per input. However, existing MoE models suffer from two critical limitations: (1) inefficient token-to-expert routing that causes excessive communication overhead, and (2) expert homogenization that leads to redundant computations. Current approaches address these challenges separately, failing to achieve simultaneous improvements in both training efficiency and model performance. We present Expert-Token Resonance (ETR), a theoretically-grounded bidirectional routing mechanism that fundamentally reimagines expert-token interactions in MoE architectures. Our key insight is that optimal routing requires adaptive coordination between token-choice routing (TCR) during early training phases and expert-choice routing (ECR) in later stages. We prove that this dynamic approach maximizes training success rate (the probability of correct token-expert assignments) while reducing the expert capacity lower bound by up to 40%. ETR incorporates three technical innovations: (1) an affinity-based routing architecture using Grouped Average Pooling (GrAP) that reduces computational complexity from O(d^2) to O(d^2/D) while maintaining orthogonality to prevent expert homogenization; (2) a bidirectional selection mechanism that enables both tokens and experts to actively participate in the routing process based on cosine similarity scores; and (3) an adaptive capacity strategy that dynamically adjusts expert bounds based on training progress, eliminating communication bubbles in All-to-All operations. Extensive experiments on Ascend NPU clusters demonstrate that ETR achieves 5.4%-46.6% improvements in end-to-end training efficiency compared to baseline MoE implementations, with 9.7%-14.5% performance gains across GDAD, GPQA, HumanEval, and TeleQnA benchmarks.
