Table of Contents
Fetching ...

Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

Alex Laitenberger, Christopher D. Manning, Nelson F. Liu

TL;DR

This work interrogates whether complex multi-stage retrieval-augmented generation (RAG) pipelines remain advantageous when long-context language models can process tens of thousands of tokens. It conducts a controlled comparison across token budgets on three long-context QA benchmarks using two stateful pipelines (ReadAgent, RAPTOR) and several baselines including DOS RAG, revealing that DOS RAG often matches or surpasses the complex methods. The authors identify four factors underlying DOS RAG’s success: preserving original passages, prioritizing recall within the model’s effective context, maintaining document order, and favoring simplicity over pipeline complexity. They advocate adopting DOS RAG as a simple, strong baseline for future RAG evaluations and emphasize benchmarking under matched token budgets as models advance.

Abstract

With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.

Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

TL;DR

This work interrogates whether complex multi-stage retrieval-augmented generation (RAG) pipelines remain advantageous when long-context language models can process tens of thousands of tokens. It conducts a controlled comparison across token budgets on three long-context QA benchmarks using two stateful pipelines (ReadAgent, RAPTOR) and several baselines including DOS RAG, revealing that DOS RAG often matches or surpasses the complex methods. The authors identify four factors underlying DOS RAG’s success: preserving original passages, prioritizing recall within the model’s effective context, maintaining document order, and favoring simplicity over pipeline complexity. They advocate adopting DOS RAG as a simple, strong baseline for future RAG evaluations and emphasize benchmarking under matched token budgets as models advance.

Abstract

With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.

Paper Structure

This paper contains 35 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: $\infty$Bench En.MC performance of various multi-stage RAG systems and long-context baselines (mean $\pm$ standard deviation over five runs). All methods use GPT-4o as the underlying reader. For token budgets greater than 5K, DOS RAG outperforms the complex multi-stage methods (ReadAgent and RAPTOR) by 2--8 points.
  • Figure 2: Comparison of single-stage vs. multi-stage RAG pipelines. Vanilla RAG/DOS RAG use a minimal retrieve-then-read setup, while RAPTOR and ReadAgent add additional preprocessing and LM-based steps (e.g., clustering, iterative summarization, pagination, gisting, lookup), increasing pipeline complexity and cost.
  • Figure 3: QuALITY performance of various multi-stage RAG systems and long-context baselines. All methods use GPT-4o as the underlying reader. Prompting long-context language models with entire documents (the full-document baseline) outperforms retrieval-augmented approaches, while DOS RAG performs the best under token budget constraints.
  • Figure 4: NarrativeQA performance of various multi-stage RAG systems and long-context baselines. All methods use GPT-4o-mini as the underlying reader. At each evaluated token budget, DOS RAG outperforms multi-stage retrieval systems and Vanilla RAG.
  • Figure 5: $\infty$Bench En.MC performance of various multi-stage RAG systems and long-context baselines (mean $\pm$ standard deviation over five runs). Comparison between GPT-4o-mini (left) and GPT-4o (right) as the reader. GPT-4o generally achieves higher accuracy, with DOS RAG peaking at a higher LM input token count, suggesting a larger effective context size. The ReadAgent results further indicate that GPT-4o can better utilize large context sizes, reaching performance levels generally comparable to the DOS RAG results despite using an excessive number of input tokens.
  • ...and 1 more figures