Table of Contents
Fetching ...

Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

Zhiyuan Fang, Yuegui Huang, Zicong Hong, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

TL;DR

Klotski addresses memory bottlenecks in MoE inference by introducing an expert-aware multi-batch pipeline that orchestrates computations around hot and gate-activated experts across multiple batches. It couples a constraint-sensitive I/O-compute planner, adaptive tensor placement across heterogeneous memory, and a correlation-aware expert prefetcher to achieve near-zero pipeline bubbles and substantial throughput gains over state-of-the-art baselines. The approach enables high-throughput MoE inference in resource-constrained environments and demonstrates up to large-magnitude throughput improvements on Mixtral MoE models, including single-GPU scenarios for very large parameter counts. This work advances practical deployment of MoE models by balancing computation, I/O, and memory across CPU, GPU, and disk resources, informing future memory-I/O-aware design for sparse models.

Abstract

Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline. Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x.

Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

TL;DR

Klotski addresses memory bottlenecks in MoE inference by introducing an expert-aware multi-batch pipeline that orchestrates computations around hot and gate-activated experts across multiple batches. It couples a constraint-sensitive I/O-compute planner, adaptive tensor placement across heterogeneous memory, and a correlation-aware expert prefetcher to achieve near-zero pipeline bubbles and substantial throughput gains over state-of-the-art baselines. The approach enables high-throughput MoE inference in resource-constrained environments and demonstrates up to large-magnitude throughput improvements on Mixtral MoE models, including single-GPU scenarios for very large parameter counts. This work advances practical deployment of MoE models by balancing computation, I/O, and memory across CPU, GPU, and disk resources, informing future memory-I/O-aware design for sparse models.

Abstract

Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline. Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x.

Paper Structure

This paper contains 24 sections, 5 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: Comparison of three kinds of pipeline. We use multiple computations of the current layer to overlap the I/O of the next layer to reduce inter-layer bubbles and adjust the experts' computation order to reduce intra-layer bubbles.
  • Figure 2: Architecture and inference process of MoE models.
  • Figure 3: Illustration of offloading an LLM in a multi-level storage system. Only a few layers of parameters can be placed in VRAM, and the rest are placed in DRAM and disk. param. refers to parameters.
  • Figure 4: Construction process of strawman offloading strategy designed for MoE Models. Each row represents a batch.
  • Figure 5: The expert heatmaps in Mixtral-8$\times$7B, decoder part of switch-base-8 and switch-base-16. The darker the color, the higher the frequency of selection.
  • ...and 10 more figures