MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
Xinru Tang, Jingxiang Hou, Dingcheng Jiang, Taiquan Wei, Jiaxin Liu, Jinyi Deng, Huizheng Wang, Qize Yang, Haoran Shang, Chao Li, Yang Hu, Shouyi Yin
TL;DR
MoEntwine tackles the bottlenecks of large-scale mixture-of-experts inference on wafer-scale chips by co-designing communication-aware layer mappings and a non-invasive load-balancing strategy. ER-Mapping reduces costly all-to-all traffic by entwining attention and MoE mappings within compact Full Token Domains, while NI-Balancer hides expert migrations on idle network resources and exploits temporal locality for agile balancing. Together, these methods yield substantial gains—up to 62% reduction in all-to-all latency, up to 54% MoE compute improvement, 22% fewer MoE communications, and about 39% higher per-device MoE performance compared with NVL72—demonstrating the practical potential of WSCs for massive expert parallelism. The work advances wafer-scale MoE deployment by aligning architectural topology with workload dynamics, enabling scalable, low-latency inference for future large LLMs.
Abstract
As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.
