Table of Contents
Fetching ...

Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection

Jing Li, Zhijie Sun, Dachao Lin, Xuan He, Binfan Zheng, Yi Lin, Rongqian Zhao, Xin Chen

TL;DR

ETR introduces Expert-Token Resonance, a bidirectional routing framework for MoE models that dynamically combines token-choice routing and expert-choice routing during training. By leveraging Grouped Average Pooling based affinity (GrAP), a cosine-similarity based affinity metric, a locality loss, and an adaptive expert-capacity strategy, ETR reduces communication bottlenecks and prevents expert homogenization. The approach yields substantial end-to-end training efficiency gains (up to ~46.6%) and consistent improvements on downstream tasks across diverse benchmarks, while also lowering memory usage and improving load balance. This unified routing paradigm enables scalable, efficient MoE deployments with stronger specialization and practical applicability to large-scale LLMs.

Abstract

Mixture-of-Experts (MoE) architectures enable efficient scaling of large language models by activating only a subset of parameters per input. However, existing MoE models suffer from two critical limitations: (1) inefficient token-to-expert routing that causes excessive communication overhead, and (2) expert homogenization that leads to redundant computations. Current approaches address these challenges separately, failing to achieve simultaneous improvements in both training efficiency and model performance. We present Expert-Token Resonance (ETR), a theoretically-grounded bidirectional routing mechanism that fundamentally reimagines expert-token interactions in MoE architectures. Our key insight is that optimal routing requires adaptive coordination between token-choice routing (TCR) during early training phases and expert-choice routing (ECR) in later stages. We prove that this dynamic approach maximizes training success rate (the probability of correct token-expert assignments) while reducing the expert capacity lower bound by up to 40%. ETR incorporates three technical innovations: (1) an affinity-based routing architecture using Grouped Average Pooling (GrAP) that reduces computational complexity from O(d^2) to O(d^2/D) while maintaining orthogonality to prevent expert homogenization; (2) a bidirectional selection mechanism that enables both tokens and experts to actively participate in the routing process based on cosine similarity scores; and (3) an adaptive capacity strategy that dynamically adjusts expert bounds based on training progress, eliminating communication bubbles in All-to-All operations. Extensive experiments on Ascend NPU clusters demonstrate that ETR achieves 5.4%-46.6% improvements in end-to-end training efficiency compared to baseline MoE implementations, with 9.7%-14.5% performance gains across GDAD, GPQA, HumanEval, and TeleQnA benchmarks.

Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection

TL;DR

ETR introduces Expert-Token Resonance, a bidirectional routing framework for MoE models that dynamically combines token-choice routing and expert-choice routing during training. By leveraging Grouped Average Pooling based affinity (GrAP), a cosine-similarity based affinity metric, a locality loss, and an adaptive expert-capacity strategy, ETR reduces communication bottlenecks and prevents expert homogenization. The approach yields substantial end-to-end training efficiency gains (up to ~46.6%) and consistent improvements on downstream tasks across diverse benchmarks, while also lowering memory usage and improving load balance. This unified routing paradigm enables scalable, efficient MoE deployments with stronger specialization and practical applicability to large-scale LLMs.

Abstract

Mixture-of-Experts (MoE) architectures enable efficient scaling of large language models by activating only a subset of parameters per input. However, existing MoE models suffer from two critical limitations: (1) inefficient token-to-expert routing that causes excessive communication overhead, and (2) expert homogenization that leads to redundant computations. Current approaches address these challenges separately, failing to achieve simultaneous improvements in both training efficiency and model performance. We present Expert-Token Resonance (ETR), a theoretically-grounded bidirectional routing mechanism that fundamentally reimagines expert-token interactions in MoE architectures. Our key insight is that optimal routing requires adaptive coordination between token-choice routing (TCR) during early training phases and expert-choice routing (ECR) in later stages. We prove that this dynamic approach maximizes training success rate (the probability of correct token-expert assignments) while reducing the expert capacity lower bound by up to 40%. ETR incorporates three technical innovations: (1) an affinity-based routing architecture using Grouped Average Pooling (GrAP) that reduces computational complexity from O(d^2) to O(d^2/D) while maintaining orthogonality to prevent expert homogenization; (2) a bidirectional selection mechanism that enables both tokens and experts to actively participate in the routing process based on cosine similarity scores; and (3) an adaptive capacity strategy that dynamically adjusts expert bounds based on training progress, eliminating communication bubbles in All-to-All operations. Extensive experiments on Ascend NPU clusters demonstrate that ETR achieves 5.4%-46.6% improvements in end-to-end training efficiency compared to baseline MoE implementations, with 9.7%-14.5% performance gains across GDAD, GPQA, HumanEval, and TeleQnA benchmarks.
Paper Structure (31 sections, 3 theorems, 25 equations, 13 figures, 3 tables)

This paper contains 31 sections, 3 theorems, 25 equations, 13 figures, 3 tables.

Key Result

Theorem 5

Under Assumptions ass:dis and ass:irr, the training success rate of TCR in each sample $\bm{x}$ is and the training success rate of ECR is $\forall i \in [n]$,

Figures (13)

  • Figure 1: The illustrative diagram of GrAP.
  • Figure 2: The illustration of affinity score.
  • Figure 3: The architecture of the gate network along with the hybrid TCR + ECR router.
  • Figure 4: The time consumption during training iterations with different schemes and cluster sizes.
  • Figure 5: The average composition of computation, communication, overlap, and idle with different schemes and cluster sizes.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 2: training success rate
  • Theorem 5
  • Corollary 6
  • Remark 7
  • Lemma 8
  • proof