Table of Contents
Fetching ...

Exploring and Improving Drafts in Blockwise Parallel Decoding

Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton

TL;DR

This work analyzes blockwise parallel decoding (BPD) as a means to reduce autoregressive LM latency by generating and verifying blocks of tokens in parallel. It reveals predictive-dynamics phenomena such as consecutive draft repetition and head-wise confidence patterns, and introduces two rescoring strategies—local neural rescoring and global $p$-best $n$-gram rescoring—operating on top-$k$ lattices to produce higher-quality drafts without changing the base model. Empirical results show up to about $+21.30\%$ improvement in block efficiency on diverse tasks, along with reduced KV cache I/O and favorable FLOP trade-offs; oracle analyses indicate substantial headroom for improvement, especially on lower-baseline tasks. The work demonstrates practical acceleration benefits for BPD with modest computational overhead, informs the design of drafting-heads, and points to future work on scaling to larger models and extending to non-greedy decoding regimes.

Abstract

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and conditionally accepted by the autoregressive model. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.

Exploring and Improving Drafts in Blockwise Parallel Decoding

TL;DR

This work analyzes blockwise parallel decoding (BPD) as a means to reduce autoregressive LM latency by generating and verifying blocks of tokens in parallel. It reveals predictive-dynamics phenomena such as consecutive draft repetition and head-wise confidence patterns, and introduces two rescoring strategies—local neural rescoring and global -best -gram rescoring—operating on top- lattices to produce higher-quality drafts without changing the base model. Empirical results show up to about improvement in block efficiency on diverse tasks, along with reduced KV cache I/O and favorable FLOP trade-offs; oracle analyses indicate substantial headroom for improvement, especially on lower-baseline tasks. The work demonstrates practical acceleration benefits for BPD with modest computational overhead, informs the design of drafting-heads, and points to future work on scaling to larger models and extending to non-greedy decoding regimes.

Abstract

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and conditionally accepted by the autoregressive model. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.
Paper Structure (45 sections, 4 equations, 10 figures, 13 tables, 2 algorithms)

This paper contains 45 sections, 4 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: (a) Illustration of two tokens that are decoded by autoregressive decoding vs. two tokens drafted by BPD. (b) Outputs from our proposed algorithms, where the top-$k$ token-level predictions are refined using local neural and global n-gram rescoring, which selects the $p$ most probable sequences by dynamic programming, for batched verification.
  • Figure 1: Per-task test performance of each finetuned model and block efficiency over language modeling (LM), extractive question answering (QA), and both long and short summarization (L-Sum & S-Sum).
  • Figure 2: Relative performance of our methods to standard BPD with a 1.5B parameter blockwise parallel LM on NewsRoom dataset newsroom. Details are described in \ref{['sec:app_memory']}.
  • Figure 3: (a) Entropy distributions across block draft heads on LAMBADA. The density plots illustrate the entropy distribution for each head in the model. (b) Correlation between block efficiency and $h_{\max}$, the head until which the average entropy in a task increases monotonically.
  • Figure 4: An example of a top-5 sausage lattice generated on a NewsRoom example. Edge weights correspond to (rescored) logits. Edges at each time step are ordered in descending weight and green, bolded edges correspond to candidates matching the greedy decode over the next nine tokens: "... desktop computers with new Intel Corp processors that it ...". The initial node in this graph is state 0 and the final node is 9.
  • ...and 5 more figures