Table of Contents
Fetching ...

MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference

Xinru Tang, Jingxiang Hou, Dingcheng Jiang, Taiquan Wei, Jiaxin Liu, Jinyi Deng, Huizheng Wang, Qize Yang, Haoran Shang, Chao Li, Yang Hu, Shouyi Yin

TL;DR

MoEntwine tackles the bottlenecks of large-scale mixture-of-experts inference on wafer-scale chips by co-designing communication-aware layer mappings and a non-invasive load-balancing strategy. ER-Mapping reduces costly all-to-all traffic by entwining attention and MoE mappings within compact Full Token Domains, while NI-Balancer hides expert migrations on idle network resources and exploits temporal locality for agile balancing. Together, these methods yield substantial gains—up to 62% reduction in all-to-all latency, up to 54% MoE compute improvement, 22% fewer MoE communications, and about 39% higher per-device MoE performance compared with NVL72—demonstrating the practical potential of WSCs for massive expert parallelism. The work advances wafer-scale MoE deployment by aligning architectural topology with workload dynamics, enabling scalable, low-latency inference for future large LLMs.

Abstract

As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.

MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference

TL;DR

MoEntwine tackles the bottlenecks of large-scale mixture-of-experts inference on wafer-scale chips by co-designing communication-aware layer mappings and a non-invasive load-balancing strategy. ER-Mapping reduces costly all-to-all traffic by entwining attention and MoE mappings within compact Full Token Domains, while NI-Balancer hides expert migrations on idle network resources and exploits temporal locality for agile balancing. Together, these methods yield substantial gains—up to 62% reduction in all-to-all latency, up to 54% MoE compute improvement, 22% fewer MoE communications, and about 39% higher per-device MoE performance compared with NVL72—demonstrating the practical potential of WSCs for massive expert parallelism. The work advances wafer-scale MoE deployment by aligning architectural topology with workload dynamics, enabling scalable, low-latency inference for future large LLMs.

Abstract

As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.

Paper Structure

This paper contains 35 sections, 2 equations, 17 figures, 1 table, 1 algorithm.

Figures (17)

  • Figure 1: (a) MoE Latency Breakdown of DeepSeek-V3 with $EP$ equals to device count, total latency equals to the maximum of computation and communication time. (b) System Architecture of DGX, NVL72, and WSC.
  • Figure 2: (a) Structure of LLM involving some sparse layers for MoE which activates 2 experts out of 4 for each token. (b) A parallelism strategy illustration with DP=2, TP=2 for attention layer and EP=4 for MoE layer.
  • Figure 3: (a) Top view and (b) cross-section view of Wafer-scale Chips.
  • Figure 4: The EP that each cluster can achieve and the corresponding per-device MoE performance.
  • Figure 5: (a) All-reduce of attention layer with DP=4, TP=4. (b) All-to-all of MoE layer with EP=16. Each face denotes an expert.
  • ...and 12 more figures