Table of Contents
Fetching ...

MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye

TL;DR

This paper introduces MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space.

Abstract

Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE-SpAc .

MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

TL;DR

This paper introduces MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space.

Abstract

Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE-SpAc .
Paper Structure (38 sections, 1 theorem, 30 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 30 equations, 12 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

(Bounded Drift of Expert Gating Scores) Let $\mathbf{h}_t^{(0)}$ and $\mathbf{h}_{t+1}^{(0)}$ be the input embeddings for two consecutive inference steps with bounded initial divergence $\delta_{in} = ||\mathbf{h}_t^{(0)} - \mathbf{h}_{t+1}^{(0)}||$. Assume the Attention and FFN modules in the Draft where $\sigma^j = \beta^j(1 + N \alpha^j)$ represents the layer-wise expansion factor.

Figures (12)

  • Figure 1: The advantages of speculative decoding (SD, Bottom) compared with traditional autoregressive decoding (AR, Top) from both theoretical (Left) and practical (Right) perspectives. Theoretically, SD enables expert reuse and transforms binary, low-information AR signals into informative frequency-valued ones. Practically, MoE-SpAc masks I/O latency by asynchronously prefetching experts during the drafting phase, unlike AR which suffers from blocking loads.
  • Figure 2: Overall framework of MoE-SpAc.
  • Figure 3: Pipeline of single layer forward in SD scenario. GPU and CPU stands for calculation on each device. $h$ in I/O stands for the transmission of hidden states of tokens. Note that the $T^*_{IO}$ stands for the prefetching time for another layer, and here is simplified for clarity.
  • Figure 4: The hot-or-cold online prediction accuracy of MoE-SpAc (SD) and HybriMoE (AR) on MMLU-Pro. Since the decoding length per step is different between SD and AR, we report the averaged accuracy of HybriMoE as the dashed line.
  • Figure 5: Stability and sensitivity analysis. Left: Impact of expert cache ratios; despite an OOM boundary at 21% due to draft model allocation, MoE-SpAc yields superior throughput compared to existing works. Middle: Scalability across generation lengths, showing consistent gains over baselines in long-context tasks. Right: Effect of the threshold cap K, where small values reduce performance by limiting the precision of the utility score.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1