Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance
Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou
TL;DR
This work tackles the mismatch between training and decoding in speculative decoding for large language and multimodal models by reframing draft path generation as a latent-variable problem. The authors introduce Variational Speculative Decoding (VSD), which optimizes an ELBO over latent draft paths and uses an EM–MCMC procedure with path-level utility, Adaptive Rejection Weighting, and Confidence-Aware Regularization to align the draft distribution with the target-model acceptance behavior. Theoretical results show that maximizing the VSD objective increases the expected accepted length and thus decoding speed, and experiments across LLMs and MLLMs demonstrate consistent, lossless improvements in acceptance length and wall-clock speedups (e.g., up to ~9.6% over EAGLE-3 and ~7.9% over ViSpec). Overall, VSD provides a principled, scalable way to bridge training and decoding in speculative decoding, yielding practical speedups without compromising output quality.
Abstract
Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.
