Table of Contents
Fetching ...

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen

TL;DR

Sparse RAG addresses the latency bottleneck in retrieval-augmented generation by decoupling context prefill from decoding. It encodes retrieved documents in parallel and uses prompt-driven per-context assessment to selectively load only highly relevant caches during generation, integrating assessment and generation in one process. On PopQA and QMSum, Sparse RAG achieves comparable or better generation quality with substantially lower decoding latency than dense baselines, while reducing the number of loaded contexts and filtering noisy content. The method demonstrates robust performance across short- and long-form tasks and offers practical benefits for on-device or resource-constrained inference, with clear avenues for extension to multimodal contexts.

Abstract

Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, filtering out undesirable contexts enhances the model's focus on relevant context, inherently improving its generation quality. Evaluation results of two datasets show that Sparse RAG can strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across both short- and long-form generation tasks.

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

TL;DR

Sparse RAG addresses the latency bottleneck in retrieval-augmented generation by decoupling context prefill from decoding. It encodes retrieved documents in parallel and uses prompt-driven per-context assessment to selectively load only highly relevant caches during generation, integrating assessment and generation in one process. On PopQA and QMSum, Sparse RAG achieves comparable or better generation quality with substantially lower decoding latency than dense baselines, while reducing the number of loaded contexts and filtering noisy content. The method demonstrates robust performance across short- and long-form tasks and offers practical benefits for on-device or resource-constrained inference, with clear avenues for extension to multimodal contexts.

Abstract

Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, filtering out undesirable contexts enhances the model's focus on relevant context, inherently improving its generation quality. Evaluation results of two datasets show that Sparse RAG can strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across both short- and long-form generation tasks.
Paper Structure (39 sections, 2 figures, 13 tables)

This paper contains 39 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 1: An overview of Sparse RAG at inference. Each of the retrieved documents will be assessed by LLMs to decide whether to keep or drop by estimating a relevance score. Then, load the documents that are considered to keep for generation.
  • Figure 2: Inference Efficiency Comparison