Table of Contents
Fetching ...

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li

TL;DR

AdapMoE addresses the on-demand loading bottleneck in edge MoE inference by unifying adaptive sensitivity-based gating, predictive prefetching, and DP-based adaptive caching. The framework dynamically reduces activated experts while controlling accuracy loss via a Fisher-information-inspired threshold and a knapsack-like cache allocation, achieving a 25% reduction in active experts and a 1.35× speedup without accuracy degradation. Key innovations include a threshold-based gating policy, cross-layer prefetching guided by activation similarity, and a dynamic programming cache allocator that adapts to platform-specific demands. Together, these components enable efficient MoE inference on resource-constrained devices and improve practical deployment of large sparse models.

Abstract

Mixture-of-Experts (MoE) models are designed to enhance the efficiency of large language models (LLMs) without proportionally increasing the computational demands. However, their deployment on edge devices still faces significant challenges due to high on-demand loading overheads from managing sparsely activated experts. This paper introduces AdapMoE, an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We observe the heterogeneity of experts loading across layers and tokens, based on which we propose a sensitivity-based strategy to adjust the number of activated experts dynamically. Meanwhile, we also integrate advanced prefetching and cache management techniques to further reduce the loading latency. Through comprehensive evaluations on various platforms, we demonstrate AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without accuracy degradation. Code is available at: https://github.com/PKU-SEC-Lab/AdapMoE.

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

TL;DR

AdapMoE addresses the on-demand loading bottleneck in edge MoE inference by unifying adaptive sensitivity-based gating, predictive prefetching, and DP-based adaptive caching. The framework dynamically reduces activated experts while controlling accuracy loss via a Fisher-information-inspired threshold and a knapsack-like cache allocation, achieving a 25% reduction in active experts and a 1.35× speedup without accuracy degradation. Key innovations include a threshold-based gating policy, cross-layer prefetching guided by activation similarity, and a dynamic programming cache allocator that adapts to platform-specific demands. Together, these components enable efficient MoE inference on resource-constrained devices and improve practical deployment of large sparse models.

Abstract

Mixture-of-Experts (MoE) models are designed to enhance the efficiency of large language models (LLMs) without proportionally increasing the computational demands. However, their deployment on edge devices still faces significant challenges due to high on-demand loading overheads from managing sparsely activated experts. This paper introduces AdapMoE, an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We observe the heterogeneity of experts loading across layers and tokens, based on which we propose a sensitivity-based strategy to adjust the number of activated experts dynamically. Meanwhile, we also integrate advanced prefetching and cache management techniques to further reduce the loading latency. Through comprehensive evaluations on various platforms, we demonstrate AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without accuracy degradation. Code is available at: https://github.com/PKU-SEC-Lab/AdapMoE.
Paper Structure (21 sections, 15 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 15 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: GPU resource utilization and task execution timeline: (a) hardware specs comparison between NVIDIA A100 and 4090 GPUs; (b) GPU time distribution; (c) task timeline with offloading.
  • Figure 2: Expert weight score distribution: (a) scores of top-1 expert per layer; (b) (c) weight score distribution examples.
  • Figure 3: Cosine similarities between the input of each layer's moe block and the input of its next layer's moe block.
  • Figure 4: Overview of AdapMoE.
  • Figure 5: Adaptive prefetching workflow. Gate i represents the gating function in layer i. Adaptive gating enables prefetch multiple next layers' experts.
  • ...and 4 more figures