Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

Shrenik Bhansali; Larry Heck

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

Shrenik Bhansali, Larry Heck

TL;DR

This work tackles AR decoding latency in LLMs by combining speculative decoding with training-aware online adaptation. The proposed Draft, Verify, & Improve (DVI) framework splits a single backbone into a shallow drafter head and a frozen verifier head, using verifier accept/reject decisions as supervision to update the drafter via a KL→RL schedule and online reward signals, all without offline drafter training. Across Spec-Bench, DVI delivers about $2\times$ end-to-end speedups while using orders of magnitude less data than prior methods, and ablations show that the KL-only baseline is insufficient without the RL component. The approach enables robust, lossless speedups with minimal training overhead and demonstrates strong performance on tasks with strong lexical structure and grounding. This training-aware self-speculation offers a practical pathway to fast, scalable inference for large language models in dynamic, live traffic settings.

Abstract

Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

TL;DR

end-to-end speedups while using orders of magnitude less data than prior methods, and ablations show that the KL-only baseline is insufficient without the RL component. The approach enables robust, lossless speedups with minimal training overhead and demonstrates strong performance on tasks with strong lexical structure and grounding. This training-aware self-speculation offers a practical pathway to fast, scalable inference for large language models in dynamic, live traffic settings.

Abstract

RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a

wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

TL;DR

Abstract

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)