Table of Contents
Fetching ...

Speculating Experts Accelerates Inference for Mixture-of-Experts

Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda

Abstract

Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14\% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at https://github.com/axonn-ai/yalis/tree/offload_prefetch.

Speculating Experts Accelerates Inference for Mixture-of-Experts

Abstract

Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14\% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at https://github.com/axonn-ai/yalis/tree/offload_prefetch.
Paper Structure (22 sections, 5 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Expert prefetching in a pre-norm MoE block. The normalized residual stream $s_l$ and default vector $d_l$ at layer $l$ form the quasi-hidden state $q_l$ which is used to predict the next-layer's experts, enabling CPU-GPU memory transfer to overlap with computation.
  • Figure 2: Nsight Systems trace for inference with Qwen-30B-A3B. Compared to on-demand loading of active experts (top), our expert prefetching approach (bottom) effectively overlaps CPU-GPU memory transfers with GPU computation, reducing transfer overhead on the critical path.
  • Figure 3: Comparison of cosine similarity between the quasi-hidden state $q_l$ constructed at layer $l$ and the true router input $s_{l+1}$.
  • Figure 4: Per-layer expert prefetch hit rates (recall@k) obtained using the quasi-hidden state $q_l$ versus the baseline $s_l$ across models.
  • Figure 5: Per-layer expert rank alignment between speculative and ground-truth routing. For each layer, we report the proportion for which the expert at a given rank (determined by its routing weight) matches between prefetched and ground-truth routing.
  • ...and 6 more figures