Table of Contents
Fetching ...

MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan

TL;DR

MoE-Spec tackles the memory-bandwidth bottleneck of speculative decoding in Mixture-of-Experts models by enforcing a fixed expert budget $B$ per layer during verification. It ranks and loads only the top-$B$ experts, using router-based aggregation for the shortlist and applying truncation or substitution to handle missing experts, all without additional training. Across three MoE architectures and five benchmarks, MoE-Spec delivers 10–30% higher throughput than EAGLE-3 at comparable quality and reveals a Pareto frontier that allows tuning accuracy versus latency. The approach leverages the heavy-tailed routing distribution to decouple verification cost from draft-tree size, offering a practical, training-free enhancement to speculative decoding in MoE models.

Abstract

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.

MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

TL;DR

MoE-Spec tackles the memory-bandwidth bottleneck of speculative decoding in Mixture-of-Experts models by enforcing a fixed expert budget per layer during verification. It ranks and loads only the top- experts, using router-based aggregation for the shortlist and applying truncation or substitution to handle missing experts, all without additional training. Across three MoE architectures and five benchmarks, MoE-Spec delivers 10–30% higher throughput than EAGLE-3 at comparable quality and reveals a Pareto frontier that allows tuning accuracy versus latency. The approach leverages the heavy-tailed routing distribution to decouple verification cost from draft-tree size, offering a practical, training-free enhancement to speculative decoding in MoE models.

Abstract

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.
Paper Structure (43 sections, 9 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 43 sections, 9 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of MoE-Spec. Standard MoE speculative decoding loads all N experts (N = 8) activated across the draft tree (top path). MoE-Spec ranks experts by aggregate routing probability and enforces a budget $B$ (here $B=4$), loading only top-scoring experts (bottom path). Tokens are color-coded by the number of missing experts from their natural routing. In this example, both paths accept the same three tokens despite MoE-Spec loading half as many experts.
  • Figure 2: Motivation for expert budgeting. (a) During verification, each token routes to $k$ experts; the target model must load all unique experts $\mathcal{E}$ across the draft tree. (b) As tree size $M$ grows, unique experts per layer approaches $N$, negating sparse activation benefits. (c) Routing probabilities are heavy-tailed: the top 32 of 64 experts capture 93% of routing weight for a tree size of 63.
  • Figure 3: Quality-speedup tradeoff at $T=1$, averaged across five benchmarks. For each benchmark, quality and speedup are normalized relative to AR (100%), then averaged. Error bars combine cross-benchmark variance with seed-to-seed variance from 5 runs. Individual per-benchmark curves appear in \ref{['appendix:pareto']}.
  • Figure 4: Expert activation and speedup on OLMoE-1B-7B. (\ref{['fig:mechanism-activation']}) Unique experts activated during verification as tree size increases. EAGLE-3 loads over 50 of 64 experts at large trees; MoE-Spec with $B=32$ saturates at the budget. (\ref{['fig:mechanism-speedup']}) Speedup relative to AR. EAGLE-3 peaks at tree size 31 then declines as expert loading dominates; MoE-Spec continues improving at larger trees. (\ref{['fig:mechanism-experts']}) Speedup vs. active experts. At tree size 255, EAGLE-3 loads 54 experts for 1.85$\times$ speedup; MoE-Spec loads 32 experts for 2.1$\times$ speedup.
  • Figure 5: Design ablations on OLMoE-1B-7B at $T=1$, normalized to autoregressive (AR) baseline. (a) Selection methods: Static ranking fails at low budgets while Router-based ranking tracks the Oracle upper bound. (b) Coverage policies: Truncation and Substitution perform comparably across budgets.
  • ...and 5 more figures