Table of Contents
Fetching ...

Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design

Mohan Zhang, Pingzhi Li, Jie Peng, Mufan Qiu, Tianlong Chen

TL;DR

This work identifies load balance and cross-device communication as core bottlenecks in sparsely activated Mixture-of-Experts (MoE) models and introduces a collaboration-constrained routing (C2R) strategy. By profiling expert collaboration to form specialized groups and constraining routing to top collaborators, the approach reduces inter-device traffic and enables co-location of related experts, yielding consistent accuracy gains (average ~0.51% on LLaMA-MoE and ~0.33% on Qwen-MoE) and substantial wall-clock time savings (20–30% beyond MegaBlocks). The paper also demonstrates a Pareto-optimal balance between collaboration and specialization via the hyperparameter Top-$T$, and analyzes expert collaboration patterns to justify the design. Overall, C2R offers a model-system co-design path that improves MoE efficiency while preserving or enhancing accuracy, enabling more scalable deployment of large transformer models.

Abstract

Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.

Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design

TL;DR

This work identifies load balance and cross-device communication as core bottlenecks in sparsely activated Mixture-of-Experts (MoE) models and introduces a collaboration-constrained routing (C2R) strategy. By profiling expert collaboration to form specialized groups and constraining routing to top collaborators, the approach reduces inter-device traffic and enables co-location of related experts, yielding consistent accuracy gains (average ~0.51% on LLaMA-MoE and ~0.33% on Qwen-MoE) and substantial wall-clock time savings (20–30% beyond MegaBlocks). The paper also demonstrates a Pareto-optimal balance between collaboration and specialization via the hyperparameter Top-, and analyzes expert collaboration patterns to justify the design. Overall, C2R offers a model-system co-design path that improves MoE efficiency while preserving or enhancing accuracy, enabling more scalable deployment of large transformer models.

Abstract

Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.

Paper Structure

This paper contains 25 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of C2R. (a) shows the process of expert profiling where we obtain the expert collaboration matrix for each layer of the MoE model; (b) describes the mechanism of our C2R strategy. It first selects the $\texttt{top-}1$ expert for a given token ($\mathtt{Expert\ }i$ here) and then selects the remaining $\mathtt{K-1}$ experts from list $\texttt{Top-}\mathtt{T}(\mathtt{Expert\ }i)$; (c) shows our efficient expert parallelism design.
  • Figure 2: Visualization of expert collaboration matrix in several intermediate layers of LLaMA-MoE after SFT. (a): Results with conventional top-$\mathtt{K}$ routing strategy. (b): Results with our C2R strategy ($\mathtt{T}=2$). (c): The average collaboration degree comparison between Baseline and our C2R strategy. A darker pixel in (a) and (b) indicates a higher number of tokens routed simultaneously to the corresponding experts (indexed by row and column) within the given layer, which means these two experts collaborate more frequently. Note that many pixels in (b) have a value of 0, meaning that the corresponding two experts will never be selected simultaneously, while most of the pixels in (a) have a light color indicating a non-0 value. (c) demonstrates that experts in our model exhibit a higher degree of specialization.
  • Figure 3: Performance and collaboration degree comparison of LLaMA-MoE. (a) and (b) respectively show the performance comparison between our C2R strategy (Ours) and conventional top-$\mathtt{K}$ routing strategy (Baseline) on two downstream tasks, namely Reasoning tasks and NLU tasks, with hyperparameter $\mathtt{T}$ varying from 1 to 6. (c) shows the collaboration degree comparison between Baseline and Ours under different values of hyperparameter $\mathtt{T}$ in different layers of the model. Note that since the LLaMA-MoE model we use is to select 2 out of 8 experts per layer, our method degenerates to a conventional top-$\mathtt{K}$ routing strategy (i.e., the baseline) when $\mathtt{T}=7$, so we omit this case.