Table of Contents
Fetching ...

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

Shrenik Bhansali, Larry Heck

TL;DR

This work tackles AR decoding latency in LLMs by combining speculative decoding with training-aware online adaptation. The proposed Draft, Verify, & Improve (DVI) framework splits a single backbone into a shallow drafter head and a frozen verifier head, using verifier accept/reject decisions as supervision to update the drafter via a KL→RL schedule and online reward signals, all without offline drafter training. Across Spec-Bench, DVI delivers about $2\times$ end-to-end speedups while using orders of magnitude less data than prior methods, and ablations show that the KL-only baseline is insufficient without the RL component. The approach enables robust, lossless speedups with minimal training overhead and demonstrates strong performance on tasks with strong lexical structure and grounding. This training-aware self-speculation offers a practical pathway to fast, scalable inference for large language models in dynamic, live traffic settings.

Abstract

Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

TL;DR

This work tackles AR decoding latency in LLMs by combining speculative decoding with training-aware online adaptation. The proposed Draft, Verify, & Improve (DVI) framework splits a single backbone into a shallow drafter head and a frozen verifier head, using verifier accept/reject decisions as supervision to update the drafter via a KL→RL schedule and online reward signals, all without offline drafter training. Across Spec-Bench, DVI delivers about end-to-end speedups while using orders of magnitude less data than prior methods, and ablations show that the KL-only baseline is insufficient without the RL component. The approach enables robust, lossless speedups with minimal training overhead and demonstrates strong performance on tasks with strong lexical structure and grounding. This training-aware self-speculation offers a practical pathway to fast, scalable inference for large language models in dynamic, live traffic settings.

Abstract

Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KLRL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Paper Structure

This paper contains 27 sections, 14 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Left: Multi-token speculation, where the drafter proposes a block of tokens and the verifier accepts the longest agreeing prefix before emitting the first mismatch. We log one tuple per drafted position up to and including the first reject, $(h_k, a, \text{logits}^{\phi}, r, \text{prev\_id})$, with $r{=}1$ for accepted tokens and $r{=}0$ for the first reject. This converts verifier feedback into continual self-supervision. Right: DVI architecture, where the backbone is split at layer $k$, with shallow drafting layers (purple) feeding the LoRA draft head $p_\theta(\cdot\mid h_k)$ and deep verification layers (blue) feeding the frozen verifier head $p_\phi(\cdot\mid h_L)$. The logged tuples from the rollout buffer drive updates to the draft head, while the verifier and backbone remain fixed. This closes the loop between online speculation and training, ensuring adaptation without additional models or offline data.
  • Figure 2: Objective ablations: Batch acceptance rate vs. training steps. Curves computed on the same data stream, split, and $k_{\text{spec}}$ as the main setup.