Table of Contents
Fetching ...

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia

TL;DR

SpecPV tackles the bottleneck of verification in long-context speculative decoding by introducing partial verification with a small on-device KV cache and periodic full verification to rectify drift. It leverages self-speculative drafting that reuses target-model features, enabling fast verification with minimal additional Forward-Pass cost. Across LLaMA-3.1-8B-Instruct and Qwen-3 series, it achieves up to 6x speedups with negligible degradation in QA and summarization quality, validating its effectiveness on very long contexts (up to 60–64K). The method integrates with EAGLE-3 via YARN adaptations and supports memory-constrained setups through selective KV-cache offloading, offering a practical path to efficient long-context generation.

Abstract

Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

TL;DR

SpecPV tackles the bottleneck of verification in long-context speculative decoding by introducing partial verification with a small on-device KV cache and periodic full verification to rectify drift. It leverages self-speculative drafting that reuses target-model features, enabling fast verification with minimal additional Forward-Pass cost. Across LLaMA-3.1-8B-Instruct and Qwen-3 series, it achieves up to 6x speedups with negligible degradation in QA and summarization quality, validating its effectiveness on very long contexts (up to 60–64K). The method integrates with EAGLE-3 via YARN adaptations and supports memory-constrained setups through selective KV-cache offloading, offering a practical path to efficient long-context generation.

Abstract

Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

Paper Structure

This paper contains 21 sections, 3 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Drafting and verification time of EAGLE-3 speculative decoding on LLaMA-3.1-8B-Instruct. As the context length increases, verification gradually becomes the dominant bottleneck.
  • Figure 2: Illustration of three verification processes in SpecPV. For short context, we adopt classic full verification, whereas for long context, we use partial verification to improve efficiency. Periodic full verification eliminates accumulated errors and refreshes the partial KV cache. Taken together, these modes balance efficiency and accuracy across different context length.
  • Figure 3: Illustration of the generation process in self-speculative decoding. A key characteristic of self-speculation is that the draft model reuses the layer features from the target LLM. For clarity, we present the naive single-sequence drafting.
  • Figure 4: Decoding throughput of LLaMA3.1-8B-Instruct on a single RTX 4090 GPU with KV cache offloading. Since SpecPV’s partial cache is small and does not require offloading to host memory, partial verification yields significant speedup.
  • Figure 5: Accuracy on QA tasks under different partial KV cache budgets. For LongBench v2, samples exceeding 64K context length are excluded. For all datasets, we first generate a chain of thought and then prompt the model to produce a standardized final answer. For most datasets, SpecPV achieves performance comparable to full verification under a 4096 token KV budget.
  • ...and 3 more figures