Table of Contents
Fetching ...

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon

TL;DR

This work introduces ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU, and implements a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches.

Abstract

Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language Models (LLMs) in terms of performance, face significant deployment challenges during inference due to their high memory demands. Existing offloading techniques, which involve swapping activated and idle experts between the GPU and CPU, often suffer from rigid expert caching mechanisms. These mechanisms fail to adapt to dynamic routing, leading to inefficient cache utilization, or incur prohibitive costs for prediction training. To tackle these inference-specific challenges, we introduce ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU. This reduces overhead and boosts system performance. Central to our approach is a predictive routing path-based offloading mechanism that utilizes a lightweight predictor to accurately forecast routing paths before computation begins. This proactive strategy allows for real-time error correction in expert caching, significantly increasing cache hit ratios and reducing the frequency of expert transfers, thereby minimizing I/O overhead. Additionally, we implement a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches. This method not only reduces the number of activated experts per batch but also improves computational efficiency. Our extensive experiments demonstrate that ExpertFlow achieves up to 93.72\% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods, highlighting its effectiveness and utility as a robust solution for resource-constrained inference scenarios.

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

TL;DR

This work introduces ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU, and implements a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches.

Abstract

Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language Models (LLMs) in terms of performance, face significant deployment challenges during inference due to their high memory demands. Existing offloading techniques, which involve swapping activated and idle experts between the GPU and CPU, often suffer from rigid expert caching mechanisms. These mechanisms fail to adapt to dynamic routing, leading to inefficient cache utilization, or incur prohibitive costs for prediction training. To tackle these inference-specific challenges, we introduce ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU. This reduces overhead and boosts system performance. Central to our approach is a predictive routing path-based offloading mechanism that utilizes a lightweight predictor to accurately forecast routing paths before computation begins. This proactive strategy allows for real-time error correction in expert caching, significantly increasing cache hit ratios and reducing the frequency of expert transfers, thereby minimizing I/O overhead. Additionally, we implement a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches. This method not only reduces the number of activated experts per batch but also improves computational efficiency. Our extensive experiments demonstrate that ExpertFlow achieves up to 93.72\% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods, highlighting its effectiveness and utility as a robust solution for resource-constrained inference scenarios.

Paper Structure

This paper contains 31 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of ExpertFlow system.
  • Figure 2: Illustration of Vanilla MoE and our ExpertFlow. Our predictor assesses the status of all experts across layers in a single pass before MoE computations begin, enabling proactive expert scheduling. $B,S,L,E$ indicates the batch size, sequence length, the number of MoE layers and experts per layer, respectively.
  • Figure 3: The workflow of Expert Cache Engine (ECE). ECE pre-schedules the experts based on the routing path predictions before MoE computation. During MoE computation, ECE detects incorrect predictions, such as with unwanted expert $e_{23}$ and missed expert $e_{24}$, and quickly prioritizes them in the high-priority queue for swapping during the execution of $e_{22}$, ensuring efficient operation.
  • Figure 4: An example of the Token Scheduler applied to a single MoE layer with 4 experts $e_i$, showing two batches, each with 4 tokens $t_i$. Left: The cache mechanism becomes inefficient. Right: The tokens with similar expert selection are grouped into the same batch to reduce the number of active experts and increase the token load per expert, enhancing cache hit ratio and computational efficiency.
  • Figure 5: Comparison between sequential offload pipeline and our Multi-Stream Overlapping Pipeline.
  • ...and 4 more figures