Table of Contents
Fetching ...

DReSD: Dense Retrieval for Speculative Decoding

Milan Gritta, Huiyin Xue, Gerasimos Lampouras

TL;DR

The paper addresses the latency bottleneck in autoregressive LLM generation by improving speculative decoding (SD) with a retrieval-based approach. It introduces DReSD, a dense retrieval framework that uses approximate nearest neighbour search over contextualised token embeddings, coupled with normalisation, PCA-based dimensionality reduction, and batch verification to draft next tokens. Across CodeAlpaca and MT-Bench tasks, DReSD yields about $87\%$ higher acceptance rates, drafts that are $65\%$ longer, and up to $4.64\times$ faster decoding than a strong sparse baseline, driven by three factors: effective dense retrieval, datastore alignment (especially ID datastore), and an optimised draft shape. This work demonstrates that semantic retrieval can substantially outperform exact-string retrieval for SD, offering a scalable, plug-and-play pathway to accelerate decoding in retrieval-based SD with broad practical impact for LLM workflows and real-time applications.

Abstract

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (REST), which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).

DReSD: Dense Retrieval for Speculative Decoding

TL;DR

The paper addresses the latency bottleneck in autoregressive LLM generation by improving speculative decoding (SD) with a retrieval-based approach. It introduces DReSD, a dense retrieval framework that uses approximate nearest neighbour search over contextualised token embeddings, coupled with normalisation, PCA-based dimensionality reduction, and batch verification to draft next tokens. Across CodeAlpaca and MT-Bench tasks, DReSD yields about higher acceptance rates, drafts that are longer, and up to faster decoding than a strong sparse baseline, driven by three factors: effective dense retrieval, datastore alignment (especially ID datastore), and an optimised draft shape. This work demonstrates that semantic retrieval can substantially outperform exact-string retrieval for SD, offering a scalable, plug-and-play pathway to accelerate decoding in retrieval-based SD with broad practical impact for LLM workflows and real-time applications.

Abstract

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (REST), which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).

Paper Structure

This paper contains 37 sections, 4 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Fastest configurations for selected SD methods (greedy decoding), relative to auto-regressive generation (LLM), CL = CodeLlama, LC = Llama2-Chat.
  • Figure 2: A flowchart of the DReSD framework.
  • Figure 3: An illustration of batch verification with 5 drafts (rows) with a length of 8 (columns). The $\mathsf{EOS}$ id (0 in this example) is used as padding. The green sequence is accepted, blue sequences are discarded.
  • Figure 4: Mean Acceptance Rates (MAR) for the Code Assistant. Suffix "-I" denotes the ID datastore setting.
  • Figure 5: Cumulative Explained Variance Ratio for a 256-dimensional PCA model. We use the first 64 dims.
  • ...and 4 more figures