ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Xiaoniu Song, Zihang Zhong, Rong Chen, Haibo Chen
TL;DR
ProMoE tackles the memory bottlenecks of MoE-based LLMs on edge devices by introducing a proactive caching system that predicts future expert usage and prefetches them before they’re needed. It combines a learned predictor (trained offline) with stride prefetching and a set of coordination techniques—chunked prefetching, early preemption, and reordered inference—to hide data transfer costs from the inference path. The approach is integrated into popular LLM frameworks and evaluated across multiple models and datasets, achieving substantial speedups over reactive caching and existing offloading baselines, thereby enabling faster edge deployment of MoE-based LLMs. The work introduces the GoodPred metric to quantify predictor quality and provides open-source tooling for reproducibility and further development.
Abstract
The promising applications of large language models are often limited by the constrained GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help address this issue by activating only a subset of the model's parameters during computation. This approach allows the unused parameters to be offloaded to host memory, thereby reducing the overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively, which significantly impacts system performance. In this paper, we introduce ProMoE, a novel proactive caching system that utilizes intermediate results to predict subsequent expert usage. By proactively fetching experts in advance, ProMoE eliminates passive cache misses, removes loading time from the critical path, and reduces the performance overhead associated with offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.20x (up to 3.21x) and 2.07x (up to 5.02x) in the prefill and decode stages, respectively, compared to existing offloading solutions.
