Exploring and Improving Drafts in Blockwise Parallel Decoding
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton
TL;DR
This work analyzes blockwise parallel decoding (BPD) as a means to reduce autoregressive LM latency by generating and verifying blocks of tokens in parallel. It reveals predictive-dynamics phenomena such as consecutive draft repetition and head-wise confidence patterns, and introduces two rescoring strategies—local neural rescoring and global $p$-best $n$-gram rescoring—operating on top-$k$ lattices to produce higher-quality drafts without changing the base model. Empirical results show up to about $+21.30\%$ improvement in block efficiency on diverse tasks, along with reduced KV cache I/O and favorable FLOP trade-offs; oracle analyses indicate substantial headroom for improvement, especially on lower-baseline tasks. The work demonstrates practical acceleration benefits for BPD with modest computational overhead, informs the design of drafting-heads, and points to future work on scaling to larger models and extending to non-greedy decoding regimes.
Abstract
Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and conditionally accepted by the autoregressive model. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.
