Table of Contents
Fetching ...

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Liujianfu Wang, Yuyang Du, Yuchen Pan, Soung Chang Liew, Jiacheng Liu, Kexin Chen

TL;DR

The paper tackles memory bottlenecks in edge MoE inference by eliminating expert caches and introducing fully on-demand loading. It leverages Scaled Emulative Prediction (SEP) via a fast shadow-emulation flow to forecast future expert activations and coordinates decoding across distributed edge nodes. Key innovations include group-based parallel loading, round-robin scheduling, and KV/token alignment to maintain accuracy during autoregression, enabling edge devices with sub-1GB GPUs to perform MoE inference with near-full-precision QA. Empirical results on a 10-node testbed show ~75% decoding speed of a fully cached deployment using only about 1/3 the GPU memory, with SEPs achieving up to 99.94% recall, and the work is open-sourced for reproducibility and broader adoption.

Abstract

Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead while expert computation is ongoing. With these innovations, OD-MoE dynamically loads each target expert to one of the distributed nodes just-in-time before its activation and promptly evicts it afterward, freeing GPU memory for subsequent experts. We comprehensively benchmark OD-MoE against state-of-the-art MoE offloading systems on a ten-node testbed. Experimental results show that: 1) OD-MoE achieves 99.94% expert activation prediction accuracy, substantially surpassing all existing methods; and 2) OD-MoE delivers approximately 75% of the decoding speed of a fully GPU-cached MoE deployment while using only 1/3 of the GPU memory. More importantly, by eliminating the need for expert caches, OD-MoE enables MoE inference on edge nodes with less-than-1GB GPU memory, paving the way for practical MoE deployment of low-cost IoT devices at the edge in the LLM era.

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

TL;DR

The paper tackles memory bottlenecks in edge MoE inference by eliminating expert caches and introducing fully on-demand loading. It leverages Scaled Emulative Prediction (SEP) via a fast shadow-emulation flow to forecast future expert activations and coordinates decoding across distributed edge nodes. Key innovations include group-based parallel loading, round-robin scheduling, and KV/token alignment to maintain accuracy during autoregression, enabling edge devices with sub-1GB GPUs to perform MoE inference with near-full-precision QA. Empirical results on a 10-node testbed show ~75% decoding speed of a fully cached deployment using only about 1/3 the GPU memory, with SEPs achieving up to 99.94% recall, and the work is open-sourced for reproducibility and broader adoption.

Abstract

Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead while expert computation is ongoing. With these innovations, OD-MoE dynamically loads each target expert to one of the distributed nodes just-in-time before its activation and promptly evicts it afterward, freeing GPU memory for subsequent experts. We comprehensively benchmark OD-MoE against state-of-the-art MoE offloading systems on a ten-node testbed. Experimental results show that: 1) OD-MoE achieves 99.94% expert activation prediction accuracy, substantially surpassing all existing methods; and 2) OD-MoE delivers approximately 75% of the decoding speed of a fully GPU-cached MoE deployment while using only 1/3 of the GPU memory. More importantly, by eliminating the need for expert caches, OD-MoE enables MoE inference on edge nodes with less-than-1GB GPU memory, paving the way for practical MoE deployment of low-cost IoT devices at the edge in the LLM era.

Paper Structure

This paper contains 15 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Architecture of OD-MoE. The example here shows the ten-node testbed we have developed, which includes eight worker nodes, one main node, and one shadow node.
  • Figure 2: Timing diagram of OD-MoE illustrating the round-robin scheduling scheme therein. Main node computation for layer $l$ is denoted by $M_l$. Shadow node computation for layer $l$ is denoted by $S_l$. Expert loading and expert computation for layer $l$ are denoted by $EL_l$ and $EC_l$
  • Figure 3: Expert-selection recall rate versus output token index. Three different quantization schemes (NF4, INT8, and FP16) have been considered for the shadow model, while the original model is realized with a precision of FP32.
  • Figure 4: Illustrative timing diagram of OD-MoE, without alignment for the shadow model.
  • Figure 5: Illustrative timing diagram of OD-MoE, with alignment for the shadow model.
  • ...and 5 more figures