Table of Contents
Fetching ...

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina

TL;DR

MoE-Infinity introduces a sparsity-aware expert cache to enable efficient MoE inference on personal machines with limited GPU memory. By modeling per-layer expert activations with Expert Activation Matrices (EAM) and leveraging a historical Expert Activation Matrix Collection (EAMC) to predict future activations, it couples targeted prefetching and a layer-aware eviction policy to maintain a compact, highly effective cache. The approach yields 3.1–16.7× per-token latency improvements over several baselines across multiple MoE models and tasks, and remains effective across long-context scenarios while preserving small CPU-GPU overhead. The work provides open-source tooling to help local deployment of large MoE LLMs and demonstrates robust performance on commodity hardware.

Abstract

This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

TL;DR

MoE-Infinity introduces a sparsity-aware expert cache to enable efficient MoE inference on personal machines with limited GPU memory. By modeling per-layer expert activations with Expert Activation Matrices (EAM) and leveraging a historical Expert Activation Matrix Collection (EAMC) to predict future activations, it couples targeted prefetching and a layer-aware eviction policy to maintain a compact, highly effective cache. The approach yields 3.1–16.7× per-token latency improvements over several baselines across multiple MoE models and tasks, and remains effective across long-context scenarios while preserving small CPU-GPU overhead. The work provides open-source tooling to help local deployment of large MoE LLMs and demonstrates robust performance on commodity hardware.

Abstract

This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity
Paper Structure (19 sections, 9 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: MoE inference on GPU with full model offloaded onto CPU memory. E[0,1] refers to an expert module at layer 0 with index 1.
  • Figure 2: Expert reuse count over decoding iterations for two sample sequences and merged over 1000 sequences. Darker colour means higher reuse normalized. Sampled from last layer of Mixtral-8x7B (top, 20 decoding iterations) and DeepSeek-V2-Lite (bottom, 256 decoding iterations).
  • Figure 3: Cluster the activation matrix with K-means, the matrix within the same group has similar value. The activation state is modelled by a Markov Chain.
  • Figure 5: Example of computing activation likelihood.
  • Figure 6: Example of integrating caching with prefetching. LRU is the most commonly implemented technique in SOTA systems such as vLLM, Llama.cpp, DeepSpeed and Statistical Count is implemented in BrainStorm.
  • ...and 4 more figures