Enabling MoE on the Edge via Importance-Driven Expert Scheduling
Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang
TL;DR
This paper tackles the challenge of deploying fine-grained Mixture of Experts (MoE) LLMs on memory-limited edge devices by introducing SMoE, an importance-driven expert scheduler with substitution. SMoE substitutes low-importance active experts with functionally similar GPU-resident substitutes and prefetches top-score experts to overlap loading with computation, augmented by a CPU-assisted loading pipeline to balance workloads. The approach is formalized with per-layer substitution objectives, an expert-cache router, and a score-aware eviction policy, and is validated across multiple MoE models, GPUs, and edge-like workloads, achieving up to 48% decoding latency reduction and over 60% GPU-cache hit rate with near-lossless accuracy. The solution significantly reduces PCIe transfers and CPU overhead, enabling more scalable and private edge deployments of MoE LLMs for real-world applications.
Abstract
The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.
