Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design
Mohan Zhang, Pingzhi Li, Jie Peng, Mufan Qiu, Tianlong Chen
TL;DR
This work identifies load balance and cross-device communication as core bottlenecks in sparsely activated Mixture-of-Experts (MoE) models and introduces a collaboration-constrained routing (C2R) strategy. By profiling expert collaboration to form specialized groups and constraining routing to top collaborators, the approach reduces inter-device traffic and enables co-location of related experts, yielding consistent accuracy gains (average ~0.51% on LLaMA-MoE and ~0.33% on Qwen-MoE) and substantial wall-clock time savings (20–30% beyond MegaBlocks). The paper also demonstrates a Pareto-optimal balance between collaboration and specialization via the hyperparameter Top-$T$, and analyzes expert collaboration patterns to justify the design. Overall, C2R offers a model-system co-design path that improves MoE efficiency while preserving or enhancing accuracy, enabling more scalable deployment of large transformer models.
Abstract
Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.
