MoE-Spec: Expert Budgeting for Efficient Speculative Decoding
Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan
TL;DR
MoE-Spec tackles the memory-bandwidth bottleneck of speculative decoding in Mixture-of-Experts models by enforcing a fixed expert budget $B$ per layer during verification. It ranks and loads only the top-$B$ experts, using router-based aggregation for the shortlist and applying truncation or substitution to handle missing experts, all without additional training. Across three MoE architectures and five benchmarks, MoE-Spec delivers 10–30% higher throughput than EAGLE-3 at comparable quality and reveals a Pareto frontier that allows tuning accuracy versus latency. The approach leverages the heavy-tailed routing distribution to decouple verification cost from draft-tree size, offering a practical, training-free enhancement to speculative decoding in MoE models.
Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.
