Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Xiandong Zou; Jianshu Li; Jing Huang; Pan Zhou

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou

TL;DR

This work tackles the mismatch between training and decoding in speculative decoding for large language and multimodal models by reframing draft path generation as a latent-variable problem. The authors introduce Variational Speculative Decoding (VSD), which optimizes an ELBO over latent draft paths and uses an EM–MCMC procedure with path-level utility, Adaptive Rejection Weighting, and Confidence-Aware Regularization to align the draft distribution with the target-model acceptance behavior. Theoretical results show that maximizing the VSD objective increases the expected accepted length and thus decoding speed, and experiments across LLMs and MLLMs demonstrate consistent, lossless improvements in acceptance length and wall-clock speedups (e.g., up to ~9.6% over EAGLE-3 and ~7.9% over ViSpec). Overall, VSD provides a principled, scalable way to bridge training and decoding in speculative decoding, yielding practical speedups without compromising output quality.

Abstract

Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

TL;DR

Abstract

Paper Structure (25 sections, 4 theorems, 38 equations, 1 figure, 5 tables, 2 algorithms)

This paper contains 25 sections, 4 theorems, 38 equations, 1 figure, 5 tables, 2 algorithms.

Introduction
Related Work
Preliminary
Methodology
Variational Speculative Decoding
Expectation step
Maximization Step
Theoretic Analysis
Experiment
Main Results
Ablation Study
Conclusion
Algorithm
Algorithm Detail
Algorithm Rationale
...and 10 more sections

Key Result

Theorem 1

By optimizing the VSD objective in Eqn. eq:vsd_eq5, the expected accepted length satisfies

Figures (1)

Figure 1: (a) Fraction where the accepted path coincides the greedy path and fraction of training-time greedy paths that are pruned during draft tree construction. (b) Comparison of acceptance length between greedy path and optimal high-confidence path.

Theorems & Definitions (7)

Theorem 1
Theorem 2
proof
Theorem
proof
Theorem : Guaranteed lower bound improvement over KL-only
proof

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

TL;DR

Abstract

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (7)