Table of Contents
Fetching ...

REFRAG: Rethinking RAG based Decoding

Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan

TL;DR

REFRAG tackles the problem of long-context latency in retrieval-augmented generation by exploiting the sparsity and structure of RAG contexts. It introduces a decoding framework that compresses, senses, and expands chunk embeddings, enabling pre-computation and selective expansion via RL, while preserving autoregressive decoding. Empirical results show up to $30.85\times$ TTFT acceleration and up to $16\times$ context expansion without perplexity loss, across RAG, multi-turn conversation, and long-document summarization tasks, outperforming prior baselines like CEPE. The work demonstrates that specialized decoding strategies for RAG can dramatically improve throughput and context efficiency, enabling larger contexts without sacrificing accuracy.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

REFRAG: Rethinking RAG based Decoding

TL;DR

REFRAG tackles the problem of long-context latency in retrieval-augmented generation by exploiting the sparsity and structure of RAG contexts. It introduces a decoding framework that compresses, senses, and expands chunk embeddings, enabling pre-computation and selective expansion via RL, while preserving autoregressive decoding. Empirical results show up to TTFT acceleration and up to context expansion without perplexity loss, across RAG, multi-turn conversation, and long-document summarization tasks, outperforming prior baselines like CEPE. The work demonstrates that specialized decoding strategies for RAG can dramatically improve throughput and context efficiency, enabling larger contexts without sacrificing accuracy.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

Paper Structure

This paper contains 45 sections, 6 equations, 12 figures, 21 tables.

Figures (12)

  • Figure 1: The main design of REFRAG. The input context is chunked and processed by the light-weight encoder to produce chunk embeddings, which are precomputable for efficient reuse. A light-weight RL policy decide few chunks to expand. These chunk embeddings along with the token embeddings of the question input are fed to the decoder.
  • Figure 2: Empirical verification of inference acceleration of REFRAG with $k=16$.
  • Figure 3: Log-Perplexity on $x_{s+1:s+o}$ under varying compression rates by selectively compressing different percentages of chunks. We compare three selection methods: RL (trained policy), Perplexity-desc (heuristic: lower perplexity), Perplexity-asc (heuristic: higher perplexity), and Random (random selection).
  • Figure 4: RAG performance comparison under a strong retriever scenario (left) and a weak retriever scenario and a strong retriever scenario (right). REFRAG perform similarly to LLaMA model under the same retrieved passages (slightly better in a weaker retriever case) while outperform significantly under the same latency.
  • Figure 5: A demonstration of selective token compression. For all chunks, by default, we compress them to a single token, while for crucial chunks, we expand them.
  • ...and 7 more figures