Table of Contents
Fetching ...

Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana María Tárano, Hannah Kerner

TL;DR

This work tackles the high cost of training ultra-sparse Mixture-of-Experts models by replacing traditional Expert Parallel with a Head-Parallel, Multi-Head LatentMoE design. It decouples routing from all-to-all traffic and introduces IO-aware routing and IO-aware expert computation to achieve $O(1)$ communication with respect to the number of activated experts $k$, balanced load, and deterministic inter-GPU patterns. Empirical results show up to $1.61 imes$ faster training at the 4B-parameter scale (and $1.11 imes$ at 2B) with comparable or better accuracy, and a substantial reduction in inter-GPU communication volume for small $k$ values. The approach makes multi-billion-parameter foundation-model research more accessible by improving both efficiency and hardware practicality, especially in ultra-sparse regimes.

Abstract

Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.

Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

TL;DR

This work tackles the high cost of training ultra-sparse Mixture-of-Experts models by replacing traditional Expert Parallel with a Head-Parallel, Multi-Head LatentMoE design. It decouples routing from all-to-all traffic and introduces IO-aware routing and IO-aware expert computation to achieve communication with respect to the number of activated experts , balanced load, and deterministic inter-GPU patterns. Empirical results show up to faster training at the 4B-parameter scale (and at 2B) with comparable or better accuracy, and a substantial reduction in inter-GPU communication volume for small values. The approach makes multi-billion-parameter foundation-model research more accessible by improving both efficiency and hardware practicality, especially in ultra-sparse regimes.

Abstract

Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts , load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving communication cost regardless of , completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being faster. Our method makes multi-billion-parameter foundation model research more accessible.
Paper Structure (23 sections, 9 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 9 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison of feedforward architectures. (a) Standard MLP applies a single feedforward network to each token. (b) Standard MoE uses a router to dynamically select experts from a single set. (c) LatentMoE performs routing first, then applies linear down-projection before expert computation and linear up-projection afterward. (d) Multi-Head LatentMoE projects each token into multiple sub-tokens, each processed by an independent MoE module with its own separately-trained router and expert set. Orange blocks denote activated experts; gray blocks denote inactive experts. Black lines indicate data flow.
  • Figure 2: Token clustering for expert computation expressed as block-sparse attention. Q represents input tokens and K transpose represents expert weights. Colors encode expert assignments. Clustering reorders Q so that tokens assigned to the same expert become contiguous, yielding a block-diagonal sparsity pattern.
  • Figure 3: Comparison of Expert Parallel (EP) and Head Parallel (HP) under varying load imbalance. Left: all-to-all latency, including time spent waiting for the slowest GPU. Right: peak VRAM usage across all GPUs. We simulate load imbalance on 4 GPUs using a Zipf distribution with varying skew: skew=0.0 corresponds to uniform distribution (25% per GPU), skew=1.0 assigns 80.8% of tokens to GPU 0, and skew=2.0 assigns 99.8% to GPU 0. EP latency and memory grow with both $k$ and skew, while HP remains constant for any $k$.
  • Figure 4: Comparison of naive routing with torch.matmul and our IO-aware routing. Left: latency for forward and backward passes. Right: memory footprint for forward and backward combined. Our IO-aware routing maintains constant memory footprint regardless of the number of experts, and its backward pass remains nearly constant due to sparse gradient computation. Experiments use $B=40$, $T=2048$, $N_h=8$, $d_h=128$.
  • Figure 5: Comparison of naive expert computation (using grouped GEMM) and our IO-aware expert computation under Multi-Head LatentMoE, where multiple heads increase the number of sub-tokens. Left: forward pass latency. Right: backward pass latency (log scale). $d_e$ denotes the number of hidden neurons per expert. Naive grouped GEMM scales poorly with the number of experts, while our method remains efficient across all configurations. Experiments use $B=4$, $T=512$, $N_h=8$, $d_h=128$, $k=4$.
  • ...and 1 more figures