Table of Contents
Fetching ...

Enabling MoE on the Edge via Importance-Driven Expert Scheduling

Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang

TL;DR

This paper tackles the challenge of deploying fine-grained Mixture of Experts (MoE) LLMs on memory-limited edge devices by introducing SMoE, an importance-driven expert scheduler with substitution. SMoE substitutes low-importance active experts with functionally similar GPU-resident substitutes and prefetches top-score experts to overlap loading with computation, augmented by a CPU-assisted loading pipeline to balance workloads. The approach is formalized with per-layer substitution objectives, an expert-cache router, and a score-aware eviction policy, and is validated across multiple MoE models, GPUs, and edge-like workloads, achieving up to 48% decoding latency reduction and over 60% GPU-cache hit rate with near-lossless accuracy. The solution significantly reduces PCIe transfers and CPU overhead, enabling more scalable and private edge deployments of MoE LLMs for real-world applications.

Abstract

The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Enabling MoE on the Edge via Importance-Driven Expert Scheduling

TL;DR

This paper tackles the challenge of deploying fine-grained Mixture of Experts (MoE) LLMs on memory-limited edge devices by introducing SMoE, an importance-driven expert scheduler with substitution. SMoE substitutes low-importance active experts with functionally similar GPU-resident substitutes and prefetches top-score experts to overlap loading with computation, augmented by a CPU-assisted loading pipeline to balance workloads. The approach is formalized with per-layer substitution objectives, an expert-cache router, and a score-aware eviction policy, and is validated across multiple MoE models, GPUs, and edge-like workloads, achieving up to 48% decoding latency reduction and over 60% GPU-cache hit rate with near-lossless accuracy. The solution significantly reduces PCIe transfers and CPU overhead, enabling more scalable and private edge deployments of MoE LLMs for real-world applications.

Abstract

The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Paper Structure

This paper contains 19 sections, 7 equations, 19 figures, 4 tables, 2 algorithms.

Figures (19)

  • Figure 1: Traditional MoE layer vs. our expert scheduler with substitution (via substituting low-score experts and prefetching top-score experts).
  • Figure 2: Online Expert Offloading in MoE LLMs at one layer. Step 1: Router selects the active experts. Step 2: CPU computes part of the active experts in CPU memory. Step 3: Part of active experts and CPU-computed expert results are transferred to GPU memory via PCIe. Step 4: GPU processes experts, consolidating those results with CPU-computed results.
  • Figure 3: CPU/GPU (A6000) time for expert-token computing, & PCIe time for expert loading from 3 MoE LLMs.
  • Figure 4: A few high-scoring active experts greatly impact accuracy, while low-scoring ones resemble inactive experts.
  • Figure 5: Our idea: prefetching top-score experts and replacing low-score experts in each iteration at one layer.
  • ...and 14 more figures