Table of Contents
Fetching ...

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto

TL;DR

This work explores the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context, and shows that combinations of simple strategies can achieve significant inference speedups over different tasks.

Abstract

Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top-$k$ predictions for small $k$. Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

TL;DR

This work explores the effectiveness of learning-free, negligible-cost draft strategies, namely -grams obtained from the model weights and the context, and shows that combinations of simple strategies can achieve significant inference speedups over different tasks.

Abstract

Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely -grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top- predictions for small . Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.

Paper Structure

This paper contains 32 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Memory-bound to compute-bound transition: The heatmaps depict the slowdown of a model call, for varied batch size $k\in \{1, \ldots, 32\}$ and speculation length $w \in\{0, \ldots, 15\}$. The slow-downs are relative to that of standard greedy decoding with no speculation i.e. $(k, w)= (1, 0)$. The leftmost plot corresponds to a context-length of $\ell=25$, the middle to $\ell=100$ and the rightmost plot to $\ell=500$. The model used was Mistral 7B at standard bfloat-16 precision, with a single NVIDIA A100 GPU with 40GB of memory. Each square in the heat-maps corresponds to the average slow-down over five model calls.
  • Figure 2: Tokens per call as a function of $k$, the top-$k$ speculations of the model derived unigram / bigram. In addition, the plot depicts the extended bigram (described below) plotted for $w=2$ and $w=3$, showing gains comparing $w=1$ to $w=2$, but diminishing gains going to $w=3$. The results were obtained on the first 50 examples of MT-bench and Human Eval, using a 7B model (Mistral Instruct) jiangMistral7B2023.
  • Figure 3: Average wall-time speedup across datasets for Mistral7B instruct for varied $(k, w)$.
  • Figure 4: Ablations: Top: distribution of acceptance length for mixed strategies. Middle: distribution of ranking of accepted speculations amongst the top-$k$. Bottom: allocation distribution of strategies i.e. number of speculations for each strategy.
  • Figure 5: Tokens per call across datasets for Mistral7B instruct for varied $(k, w)$.
  • ...and 4 more figures