Table of Contents
Fetching ...

Accelerating MoE Model Inference with Expert Sharding

Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, André Loureiro Espírito Santo, Martijn de Vos, Milos Vujasinovic

TL;DR

MoEShard tackles inefficient MoE encoder inference on multi-GPU systems caused by skewed token routing and inter-GPU communication. It introduces tensor sharding of MoE experts, distributing $W_i$ column-wise and $W_o$ row-wise across GPUs to achieve perfect load balancing without token dropping, and fuses computations to reduce kernel launches. The design includes a six-step forward pass, a shard-friendly workflow, and a MegaBlocks-based sparse matmul optimization, with experimental results showing up to $6.4\times$ TTFT speedups against DeepSpeed on 4-GPU setups. These results demonstrate the viability of expert tensor sharding as a practical strategy for efficient MoE inference and offer a blueprint for deploying encoder MoEs at scale.

Abstract

Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4$\times$ in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.

Accelerating MoE Model Inference with Expert Sharding

TL;DR

MoEShard tackles inefficient MoE encoder inference on multi-GPU systems caused by skewed token routing and inter-GPU communication. It introduces tensor sharding of MoE experts, distributing column-wise and row-wise across GPUs to achieve perfect load balancing without token dropping, and fuses computations to reduce kernel launches. The design includes a six-step forward pass, a shard-friendly workflow, and a MegaBlocks-based sparse matmul optimization, with experimental results showing up to TTFT speedups against DeepSpeed on 4-GPU setups. These results demonstrate the viability of expert tensor sharding as a practical strategy for efficient MoE inference and offer a blueprint for deploying encoder MoEs at scale.

Abstract

Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4 in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: ECDF of token distribution per expert for the first and last layer, for a Switch transformer.
  • Figure 2: Expert computations with and without MoEShard. An expert consists of matrices $W_i$ (in green) and $W_o$ (in red).
  • Figure 3: The average TTFT of MoEShard with respect to DeepSpeed for varying numbers of experts (left) and batch sizes (right).
  • Figure 4: The average TTFT of MoEShard with and without MegaBlocks enabled, for varying numbers of experts (left) and batch sizes (right).