Table of Contents
Fetching ...

Accelerating Retrieval-Augmented Language Model Serving with Speculation

Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia

TL;DR

RaLMSpec introduces a speculative retrieval framework with batched verification to accelerate iterative retrieval-augmented language model serving while preserving outputs. By exploiting temporal/spatial locality with a local retrieval cache, plus cache prefetching, an adaptive speculation stride scheduler (OS$^3$), and asynchronous verification, it reduces retrieval overhead across diverse models and retrievers. Empirical evaluations across GPT-2, OPT, LLaMA-2, and KNN-LM setups on multiple QA datasets show consistent end-to-end speed-ups, up to $7.59\times$ for exact dense retrievers and substantial gains for approximate dense and sparse retrievers. The approach provides a general acceleration framework for iterative RaLM tasks, with practical impact on low-latency knowledge-intensive inference.

Abstract

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.

Accelerating Retrieval-Augmented Language Model Serving with Speculation

TL;DR

RaLMSpec introduces a speculative retrieval framework with batched verification to accelerate iterative retrieval-augmented language model serving while preserving outputs. By exploiting temporal/spatial locality with a local retrieval cache, plus cache prefetching, an adaptive speculation stride scheduler (OS), and asynchronous verification, it reduces retrieval overhead across diverse models and retrievers. Empirical evaluations across GPT-2, OPT, LLaMA-2, and KNN-LM setups on multiple QA datasets show consistent end-to-end speed-ups, up to for exact dense retrievers and substantial gains for approximate dense and sparse retrievers. The approach provides a general acceleration framework for iterative RaLM tasks, with practical impact on low-latency knowledge-intensive inference.

Abstract

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.
Paper Structure (26 sections, 2 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 2 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: $\{q_0, q_1, q_2\}$ denotes context-dependent query embeddings and A, B, C are document entries. \ref{['fig:iterative_ralm']} shows the workflow of existing iterative RaLM, which suffers from high retrieval overhead. \ref{['fig:realmspec']} shows an overview of RaLMSpec, which enables faster speculative retrieval steps (➀, ➂, ➄) followed by a batched verification step (➅) to guarantee correctness. Consequently, RaLMSpec achieves a lower latency while preserving model quality as shown in \ref{['fig:timeline']}.
  • Figure 2: For speculative retrieval, we maintain a local cache for each request and use the same scoring metric as the original retriever to rank the entries within the local cache for a given query. In the verification step, we populate the local cache with either the top-1 or top-k retrieved documents from the knowledge base, where the latter one is referred to as prefetching.
  • Figure 3: Asynchronous verification obtains latency saving by hiding the verification latency behind a valid speculation step. In case a mismatch is detected between the speculated document and ground truth document, the language model will regenerate outputs using the ground truth document.
  • Figure 4: Latency comparison between RaLMSeq, RaLMSpec, and RaLMSpec+PSA on GPT2-medium, OPT-1.3B, and LLaMA-2-7B over four QA datasets with three different types of retrievers, where EDR, ADR, SR stand for exact dense retriever, approximate dense retriever, and sparse retriever respectively. We decompose the overall latency into the language model generation latency (G) and retrieval latency (R) to demonstrate the trade-off.
  • Figure 5: Speedup Ratio Results for RaLMSpec on kNN-LMs using Wikipedia-QA. k stands for number of nearest neighbors in kNN-LMs, s stands for stride size, OS3 stands for optimal scheduler stride.
  • ...and 2 more figures