Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism
Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana María Tárano, Hannah Kerner
TL;DR
This work tackles the high cost of training ultra-sparse Mixture-of-Experts models by replacing traditional Expert Parallel with a Head-Parallel, Multi-Head LatentMoE design. It decouples routing from all-to-all traffic and introduces IO-aware routing and IO-aware expert computation to achieve $O(1)$ communication with respect to the number of activated experts $k$, balanced load, and deterministic inter-GPU patterns. Empirical results show up to $1.61 imes$ faster training at the 4B-parameter scale (and $1.11 imes$ at 2B) with comparable or better accuracy, and a substantial reduction in inter-GPU communication volume for small $k$ values. The approach makes multi-billion-parameter foundation-model research more accessible by improving both efficiency and hardware practicality, especially in ultra-sparse regimes.
Abstract
Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.
