FlashMoE: Fast Distributed MoE in a Single Kernel
Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh
TL;DR
Distributed Mixture-of-Experts (MoE) models suffer from CPU-driven orchestration, kernel-launch overheads, and synchronous AlltoAll communication that waste GPU cycles. FlashMoE addresses this by implementing a single persistent GPU kernel that fuses dispatch, compute, and combine phases, while a symmetric tensor layout enables payload-efficient, device-initiated DMA across GPUs; an in-kernel actor model schedules work across tile-level tasks. The approach introduces a unified task abstraction $t=(\mathcal{M}, \star, \phi)$ and demonstrates non-blocking inter-GPU transfers within a fully fused design, achieving up to $6\times$ latency reduction, $9\times$ GPU utilization, and $5.7\times$ throughput improvement on 8-GPU nodes with FP32. This GPU-native co-design reduces idle time and kernel-launch overhead, enabling scalable performance for ultra-sparse MoE configurations and motivating future extensions toward training and broader GPU-based distributed ML pipelines.
Abstract
The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.
