Table of Contents
Fetching ...

Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

Yixuan Wang, Yijun Liu, Shiyu ji, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che

TL;DR

Large language models incur high inference latency due to autoregressive decoding. The authors propose Reflective Verification, a training-free, semantics-aware verification method that uses prompt-driven self-reflection to produce reflective logits and fuse them with the original verification signals, extending accepted draft tokens without harming task performance. The approach is plug-and-play and orthogonal to existing statistical verification methods, yielding 5–15% end-to-end speedups and larger gains on bigger models and higher-quality drafts. Across MT-Bench, GSM8K, and HumanEval, Reflective Verification demonstrates robust semantic guidance for draft acceptance, with strong generalization across baselines and draft configurations, suggesting substantial practical impact for faster, reliable speculative decoding.

Abstract

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel during verification. Using prompt-based probing, we obtain both the original and reflective distributions of draft tokens in a single forward pass. The fusion of these distributions enables semantic-level verification of draft tokens that incorporates both consistency and correctness. Experiments across multiple domain benchmarks and model scales demonstrate that our method significantly increases the acceptance length of draft tokens without compromising model performance. Furthermore, we find that the proposed Reflective Verification is orthogonal to existing statistical verification methods, and their combination yields additional 5$\sim$15\% improvements in decoding speed.

Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

TL;DR

Large language models incur high inference latency due to autoregressive decoding. The authors propose Reflective Verification, a training-free, semantics-aware verification method that uses prompt-driven self-reflection to produce reflective logits and fuse them with the original verification signals, extending accepted draft tokens without harming task performance. The approach is plug-and-play and orthogonal to existing statistical verification methods, yielding 5–15% end-to-end speedups and larger gains on bigger models and higher-quality drafts. Across MT-Bench, GSM8K, and HumanEval, Reflective Verification demonstrates robust semantic guidance for draft acceptance, with strong generalization across baselines and draft configurations, suggesting substantial practical impact for faster, reliable speculative decoding.

Abstract

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel during verification. Using prompt-based probing, we obtain both the original and reflective distributions of draft tokens in a single forward pass. The fusion of these distributions enables semantic-level verification of draft tokens that incorporates both consistency and correctness. Experiments across multiple domain benchmarks and model scales demonstrate that our method significantly increases the acceptance length of draft tokens without compromising model performance. Furthermore, we find that the proposed Reflective Verification is orthogonal to existing statistical verification methods, and their combination yields additional 515\% improvements in decoding speed.

Paper Structure

This paper contains 32 sections, 2 equations, 7 figures, 4 tables, 3 algorithms.

Figures (7)

  • Figure 1: An illustration of draft tokens rejected by standard speculative decoding. Self-reflection enables the acceptance of semantically correct drafts that would otherwise be rejected.
  • Figure 2: Overall structural diagram of Reflective Verification. Compared to vanilla speculative decoding using only base outputs (yellow), we fuse them with reflective outputs (purple) as the final distribution.
  • Figure 3: Effect of $\alpha$ on task and acceleration performance.
  • Figure 4: Impact of draft quality on Reflective Verification.
  • Figure 5: An illustration of reflective verification on MT-Bench.
  • ...and 2 more figures