Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models
Alex Laitenberger, Christopher D. Manning, Nelson F. Liu
TL;DR
This work interrogates whether complex multi-stage retrieval-augmented generation (RAG) pipelines remain advantageous when long-context language models can process tens of thousands of tokens. It conducts a controlled comparison across token budgets on three long-context QA benchmarks using two stateful pipelines (ReadAgent, RAPTOR) and several baselines including DOS RAG, revealing that DOS RAG often matches or surpasses the complex methods. The authors identify four factors underlying DOS RAG’s success: preserving original passages, prioritizing recall within the model’s effective context, maintaining document order, and favoring simplicity over pipeline complexity. They advocate adopting DOS RAG as a simple, strong baseline for future RAG evaluations and emphasize benchmarking under matched token budgets as models advance.
Abstract
With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.
