Table of Contents
Fetching ...

FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh

TL;DR

Distributed Mixture-of-Experts (MoE) models suffer from CPU-driven orchestration, kernel-launch overheads, and synchronous AlltoAll communication that waste GPU cycles. FlashMoE addresses this by implementing a single persistent GPU kernel that fuses dispatch, compute, and combine phases, while a symmetric tensor layout enables payload-efficient, device-initiated DMA across GPUs; an in-kernel actor model schedules work across tile-level tasks. The approach introduces a unified task abstraction $t=(\mathcal{M}, \star, \phi)$ and demonstrates non-blocking inter-GPU transfers within a fully fused design, achieving up to $6\times$ latency reduction, $9\times$ GPU utilization, and $5.7\times$ throughput improvement on 8-GPU nodes with FP32. This GPU-native co-design reduces idle time and kernel-launch overhead, enabling scalable performance for ultra-sparse MoE configurations and motivating future extensions toward training and broader GPU-based distributed ML pipelines.

Abstract

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.

FlashMoE: Fast Distributed MoE in a Single Kernel

TL;DR

Distributed Mixture-of-Experts (MoE) models suffer from CPU-driven orchestration, kernel-launch overheads, and synchronous AlltoAll communication that waste GPU cycles. FlashMoE addresses this by implementing a single persistent GPU kernel that fuses dispatch, compute, and combine phases, while a symmetric tensor layout enables payload-efficient, device-initiated DMA across GPUs; an in-kernel actor model schedules work across tile-level tasks. The approach introduces a unified task abstraction and demonstrates non-blocking inter-GPU transfers within a fully fused design, achieving up to latency reduction, GPU utilization, and throughput improvement on 8-GPU nodes with FP32. This GPU-native co-design reduces idle time and kernel-launch overhead, enabling scalable performance for ultra-sparse MoE configurations and motivating future extensions toward training and broader GPU-based distributed ML pipelines.

Abstract

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.

Paper Structure

This paper contains 26 sections, 2 theorems, 11 equations, 15 figures, 4 tables, 4 algorithms.

Key Result

Theorem 3.1

The symmetric tensor layout $L$ is write-write conflict-free.

Figures (15)

  • Figure 2: Transformer blocks (a) without MoE, (b) with MoE, and (c) with distributed MoE and expert parallelism. T, E, and O represent input tokens, experts, and output activations, respectively.
  • Figure 3: Comparing FlashMoE with state-of-the-art techniques that either do not overlap communication and computation (left, top) or do some overlap (left, middle). FlashMoE is a persistent kernel that fuses all computation and communication of the MoE operator (left, bottom). FlashMoE implements device-initiated computation (gate, expert FFN, scale) and communication tasks (right).
  • Figure 4: \ref{['sub:util']} shows GPU utilization averaged across 100 MoE forward passes on 2 A100s with 300 GB/s unidirectional bandwidth, where we observe up to 90% idle time, due to kernel launch gaps and non-overlapping communication.
  • Figure 5: FlashMoE Fused Kernel
  • Figure 6: DMoE Functional Dependencies Expressed as a Chain of Actor Interactions. We denote $S_b$, $S_h$, and $P$ as the Subscriber, Scheduler and Processor actors, respectively. For any actor $a \in \{S_b,\>S_h,\>P\}$, $a^i$ identifies an actor on GPU $i$. We define $D^j_i$ as the operator, where GPU $j$ dispatches packets of tiles to GPU $i$. This diagram expresses task dependencies at the granularity of a tile, namely $GEMM_0$, $GEMM_1$, combine and communication produce an output tile. Notifications occur as signals propagated through shared memory (subscriber $\leftrightarrow$ scheduler) or global memory (scheduler $\leftrightarrow$ processor or inter-GPU communication). Note one-sided inter-GPU transfers (packet or single tile) are coupled with a signal to notify $S_b^j$ on the receiving GPU $j$ of the message's delivery.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Definition C.1
  • Definition C.2
  • Theorem C.1
  • proof