Table of Contents
Fetching ...

Block Verification Accelerates Speculative Decoding

Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh

TL;DR

Block Verification revisits speculative decoding by jointly verifying a drafted block of tokens rather than token-by-token, preserving the target model's distribution while increasing the number of accepted tokens per iteration. The authors prove that block verification is distribution-preserving and achieves the optimal expected decoded length among valid verification schemes. Empirically, block verification delivers consistent wall-clock improvements (roughly 5-8%) across PALM-2 and Vicuna-based tasks and datasets, with larger gains for longer draft blocks and higher-draft-model quality. The approach requires no extra computation or code complexity and can serve as a solid default in speculative decoding deployments, complementing improvements in the drafting phase.

Abstract

Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.

Block Verification Accelerates Speculative Decoding

TL;DR

Block Verification revisits speculative decoding by jointly verifying a drafted block of tokens rather than token-by-token, preserving the target model's distribution while increasing the number of accepted tokens per iteration. The authors prove that block verification is distribution-preserving and achieves the optimal expected decoded length among valid verification schemes. Empirically, block verification delivers consistent wall-clock improvements (roughly 5-8%) across PALM-2 and Vicuna-based tasks and datasets, with larger gains for longer draft blocks and higher-draft-model quality. The approach requires no extra computation or code complexity and can serve as a solid default in speculative decoding deployments, complementing improvements in the drafting phase.

Abstract

Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.
Paper Structure (27 sections, 13 theorems, 102 equations, 6 figures, 10 tables, 6 algorithms)

This paper contains 27 sections, 13 theorems, 102 equations, 6 figures, 10 tables, 6 algorithms.

Key Result

Lemma 1

The standard token verification algorithm of speculative decoding is not optimal.

Figures (6)

  • Figure 1: One iteration of speculative decoding (\ref{['alg:speculative_decoding_framework']}). The prefixes and verified tokens are in blue, the unverified tokens from the draft model are in red, and the tokens sampled from the residual distribution are underlined.
  • Figure 2: The acceptance probabilities and residual distributions in \ref{['alg:token_verify', 'alg:block_verify']}.
  • Figure 3: Empirical complementary CDF of $\tau$ for both algorithms with draft length $\gamma = 10$. The draft and target models are the context-independent toy models introduced in \ref{['eqn:toy_models']}.
  • Figure 4: Table on average block efficiency (BE) and wall clock speedup (WS) across all datasets for token verification (TokenV) and block verification (BlockV) with different $\gamma$. The large model is PALM-2-S and the drafter model is either PALM-2-XXS (XXS) or PALM-2-XXXS (XXXS).
  • Figure 5: Average relative improvement of block verification over token verification in block efficiency (BE) and wall clock speedup (WS) across all datasets for different drafters and draft lengths.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Lemma 1
  • Definition 1: Valid draft verification algorithm
  • Theorem 1
  • Theorem 2
  • Lemma 2
  • proof
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • ...and 7 more