Block Verification Accelerates Speculative Decoding
Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh
TL;DR
Block Verification revisits speculative decoding by jointly verifying a drafted block of tokens rather than token-by-token, preserving the target model's distribution while increasing the number of accepted tokens per iteration. The authors prove that block verification is distribution-preserving and achieves the optimal expected decoded length among valid verification schemes. Empirically, block verification delivers consistent wall-clock improvements (roughly 5-8%) across PALM-2 and Vicuna-based tasks and datasets, with larger gains for longer draft blocks and higher-draft-model quality. The approach requires no extra computation or code complexity and can serve as a solid default in speculative decoding deployments, complementing improvements in the drafting phase.
Abstract
Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.
