Table of Contents
Fetching ...

In Defense of RAG in the Era of Long-Context Language Models

Tan Yu, Anbang Xu, Rama Akkiraju

TL;DR

The paper challenges the notion that extremely long context windows render RAG obsolete, arguing that excessive context can dilute focus. It introduces OP-RAG, an order-preserving retrieval-augmented generation mechanism, which maintains the original document order of retrieved chunks and exhibits an inverted-U relationship between the number of chunks and answer quality. Through experiments on InfinityBench long-context QA benchmarks, OP-RAG achieves higher F1/accuracy with substantially fewer input tokens than long-context LLMs without RAG, demonstrating that carefully retrieved and ordered context can outperform brute-force long-context processing. The work highlights the continued relevance of retrieval-based methods for long-context QA and suggests practical sweet spots for efficient, high-quality answers.

Abstract

Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.

In Defense of RAG in the Era of Long-Context Language Models

TL;DR

The paper challenges the notion that extremely long context windows render RAG obsolete, arguing that excessive context can dilute focus. It introduces OP-RAG, an order-preserving retrieval-augmented generation mechanism, which maintains the original document order of retrieved chunks and exhibits an inverted-U relationship between the number of chunks and answer quality. Through experiments on InfinityBench long-context QA benchmarks, OP-RAG achieves higher F1/accuracy with substantially fewer input tokens than long-context LLMs without RAG, demonstrating that carefully retrieved and ordered context can outperform brute-force long-context processing. The work highlights the continued relevance of retrieval-based methods for long-context QA and suggests practical sweet spots for efficient, high-quality answers.

Abstract

Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
Paper Structure (9 sections, 2 equations, 4 figures, 1 table)

This paper contains 9 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparisons between the proposed order-preserve retrieval-augmented generation (OP-RAG) and approaches using long-context LLMs without RAG on En.QA dataset of $\infty$Bench. Our OP-RAG uses Llama3.1-70B as generator, which significantly outperforms its counterpart using Llama3.1-70B without RAG.
  • Figure 2: Vanilla RAG versus the proposed order-preserve the RAG. As shown in the example, a long document is cropped into $13$ chunks, $\{c_i\}_{i=1}^{13}$. The similarity score is appended to each chunk. We retrieve top 4 chunks with the highest similarity scores. Vanilla RAG places the chunks in a score-descending order, whereas the proposed order-preserve RAG places the chunks based on the order in the original document.
  • Figure 3: The influence of context length on the performance of RAG. The evaluations are conducted on En.QA and EN.MC datasets of $\infty$Bench.
  • Figure 4: Comparisons between the proposed order-preserve RAG and vanilla RAG. The evaluations are conducted on En.QA and EN.MC datasets of $\infty$Bench, using Llama3.1-70B model.