Table of Contents
Fetching ...

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai

TL;DR

MoE-Gen tackles the bottleneck of high-throughput MoE inference on a single GPU by introducing module-based batching that aggregates large input batches for the attention and expert modules using host memory. It couples a DAG-based engine, full KV-cache offloading to CPU memory, and CPU-accelerated attention to maximize GPU utilization and overlap computation with data transfers. The approach yields substantial throughput gains over model-based and continuous batching baselines across multiple MoE models and tasks, while enabling cost-efficient deployment on commodity hardware. This work advances practical offline MoE inference by balancing memory hierarchy, computation, and I/O to unlock large-scale MoE models on limited hardware.

Abstract

This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

TL;DR

MoE-Gen tackles the bottleneck of high-throughput MoE inference on a single GPU by introducing module-based batching that aggregates large input batches for the attention and expert modules using host memory. It couples a DAG-based engine, full KV-cache offloading to CPU memory, and CPU-accelerated attention to maximize GPU utilization and overlap computation with data transfers. The approach yields substantial throughput gains over model-based and continuous batching baselines across multiple MoE models and tasks, while enabling cost-efficient deployment on commodity hardware. This work advances practical offline MoE inference by balancing memory hierarchy, computation, and I/O to unlock large-scale MoE models on limited hardware.

Abstract

This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration of one layer in MoE models.
  • Figure 2: Model-based batching employs a single, unified batch size throughout the entire forward pass, whereas module-based batching iteratively process modules with small batches to form larger batches.
  • Figure 3: Left: Achieved FLOPs in the non-offloading scenario. This metric represents the number of floating-point operations performed by an expert module, normalized by the GPU compute time. Right: Percentage of GPU idle time in the offloading scenario on an NVIDIA A5000 (PCIe 4.0, 32 GB/s). This metric measures the ratio of the expert module’s execution time to the time required to transfer the necessary weights from the CPU to the GPU.
  • Figure 4: Fetching traffic over dataset, showing fully offload KV-cache benefits performance. Using Mixtral-8x7B with CPU KV-cache capacity 128GB. We pad/truncate each prompt to same length and decode same length.
  • Figure 5: MoE-Gen system components.
  • ...and 2 more figures