Table of Contents
Fetching ...

Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

Shien Zhu, Samuel Bohl, Robin Oester, Gustavo Alonso

TL;DR

The paper tackles the bottleneck of loading experts in Mixture-of-Experts LLMs by proposing pre-attention, same-layer expert prediction. It shows that pre-attention weights within the same layer can be used to predict the top-$k$ routing decisions with lightweight 2-layer predictors and a ranking-aware loss, enabling accurate first-layer prefetching and parallel execution with self-attention. Across three MoE models, the approach achieves up to $97.62\%$ exact-match accuracy and substantial I/O savings, outperforming cross-layer and hybrid baselines by roughly $15$–$19$ percentage points in exact-match accuracy and delivering real deployment benefits in both cloud and edge environments. The work provides practical deployment guidance, demonstrates significant latency reductions, and suggests avenues for extending prefetching to larger MoE architectures and dynamic provisioning.

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.

Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models

TL;DR

The paper tackles the bottleneck of loading experts in Mixture-of-Experts LLMs by proposing pre-attention, same-layer expert prediction. It shows that pre-attention weights within the same layer can be used to predict the top- routing decisions with lightweight 2-layer predictors and a ranking-aware loss, enabling accurate first-layer prefetching and parallel execution with self-attention. Across three MoE models, the approach achieves up to exact-match accuracy and substantial I/O savings, outperforming cross-layer and hybrid baselines by roughly percentage points in exact-match accuracy and delivering real deployment benefits in both cloud and edge environments. The work provides practical deployment guidance, demonstrates significant latency reductions, and suggests avenues for extending prefetching to larger MoE architectures and dynamic provisioning.

Abstract

Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.

Paper Structure

This paper contains 32 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Token generation pipeline in typical MoE architectures (profiled on DeepSeek-V2-Lite on a Nvidia V100 GPU).
  • Figure 2: (a) Aggregated expert invocation heatmap and (b) Distribution of expert activation frequencies on DeepSeek-V2-Lite across the first 9 layers when generating 300 tokens.
  • Figure 3: Example MoE layers with and without expert prediction.
  • Figure 4: Expert Selector Architecture Comparison
  • Figure 5: Sample of sorted true affinity scores.
  • ...and 3 more figures