Table of Contents
Fetching ...

MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

Seokjin Go, Divya Mahajan

TL;DR

MoETuner addresses the performance bottlenecks of large Mixture-of-Experts models by balancing per-GPU token processing and minimizing inter-GPU communication through an ILP-based expert placement framework. It leverages cross-layer routing dependencies to cluster and map experts to GPUs via two ILPs (ILP1 for intra-layer clustering and ILP2 for cluster-to-GPU placement), using a profiling phase on a subset of data to estimate routing statistics. On Mixtral-8x7B, it yields end-to-end speedups of $9.3\%$ (single-node) and $17.5\%$ (multi-node) and reduces token-processing tail latency significantly, validating improvements in both computation and communication. This approach provides a scalable path to more efficient MoE inference on large interconnects and multi-node clusters, with potential applicability to future MoE models beyond Mixtral.

Abstract

Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU data transfers. These factors degrade the performance of MoE models by increasing tail latency and reducing overall throughput. To address these limitations, we propose an Integer Linear Programming (ILP) formulation to optimize expert placement by jointly considering token load, communication, and computation costs. We exploit the property that there is a token routing dependency across layers, where tokens routed to a specific expert in one layer are likely to be routed to a limited set of experts in the subsequent layer. Our solution, MoETuner, offers an optimal expert-to-GPU assignment that minimizes inter-GPU token routing costs and balances token processing across devices, thereby reducing tail latency and end-to-end execution time. Experimental results demonstrate 9.3% and 17.5% of end-to-end speedups for single-node and multi-node inference respectively, showcasing the potential of our ILP-based optimization for offering expert parallel solutions for next-generation MoEs.

MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

TL;DR

MoETuner addresses the performance bottlenecks of large Mixture-of-Experts models by balancing per-GPU token processing and minimizing inter-GPU communication through an ILP-based expert placement framework. It leverages cross-layer routing dependencies to cluster and map experts to GPUs via two ILPs (ILP1 for intra-layer clustering and ILP2 for cluster-to-GPU placement), using a profiling phase on a subset of data to estimate routing statistics. On Mixtral-8x7B, it yields end-to-end speedups of (single-node) and (multi-node) and reduces token-processing tail latency significantly, validating improvements in both computation and communication. This approach provides a scalable path to more efficient MoE inference on large interconnects and multi-node clusters, with potential applicability to future MoE models beyond Mixtral.

Abstract

Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU data transfers. These factors degrade the performance of MoE models by increasing tail latency and reducing overall throughput. To address these limitations, we propose an Integer Linear Programming (ILP) formulation to optimize expert placement by jointly considering token load, communication, and computation costs. We exploit the property that there is a token routing dependency across layers, where tokens routed to a specific expert in one layer are likely to be routed to a limited set of experts in the subsequent layer. Our solution, MoETuner, offers an optimal expert-to-GPU assignment that minimizes inter-GPU token routing costs and balances token processing across devices, thereby reducing tail latency and end-to-end execution time. Experimental results demonstrate 9.3% and 17.5% of end-to-end speedups for single-node and multi-node inference respectively, showcasing the potential of our ILP-based optimization for offering expert parallel solutions for next-generation MoEs.

Paper Structure

This paper contains 21 sections, 9 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Token routing statistics for Mixtral-8x7B. Each colored circle represents an expert in a layer, and the lines connecting them illustrate the number of tokens routed between pairs of experts. Thicker lines indicate a higher volume of routed tokens, highlighting key routing dependencies.
  • Figure 2: An example of Mixture-of-Experts (MoE) model execution. (a) Single-GPU execution with all experts local to the GPU. (b) Expert-parallel execution, where experts are distributed across GPUs, requiring inter-GPU communication through all-to-all operations.
  • Figure 3: Time distribution of representative operations during the forward pass of Mixtral-8x7B. The inference time is primarily dominated by all-to-all communication between GPU pairs, particularly in multi-node environments.
  • Figure 4: Expert activation frequency of Mixtral-8x7B, highlighting significant load imbalance across layers. Darker regions indicate a higher skew. For example, in layer 14, experts 0 and 1 process 64% of the total tokens.
  • Figure 5: Number of tokens dispatched across different GPU pairs. Certain GPU pairs experience substantially higher communication volumes compared to others.
  • ...and 9 more figures