Table of Contents
Fetching ...

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao

TL;DR

Sage introduces a reasoning-intensive benchmark for scientific literature retrieval and shows that, within deep-research agent workflows, traditional lexical retrieval (BM25) often surpasses modern LLM-based retrievers for both short-form and open-ended tasks. A corpus-level test-time scaling framework is proposed to enrich documents with metadata and keywords, improving retrieval performance, particularly for short-form queries. The work highlights a misalignment between agent-driven sub-queries (often keyword-like) and the semantic strengths of LLM-based retrievers, suggesting that retriever-agent co-design and corpus augmentation are critical for practical deep-research systems. Overall, Sage provides a controlled, up-to-date evaluation environment and demonstrates that document-side signals can robustly boost retrieval when paired with off-the-shelf retrievers, while acknowledging limitations and avenues for future alignment between agents and retrieval backends.

Abstract

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

TL;DR

Sage introduces a reasoning-intensive benchmark for scientific literature retrieval and shows that, within deep-research agent workflows, traditional lexical retrieval (BM25) often surpasses modern LLM-based retrievers for both short-form and open-ended tasks. A corpus-level test-time scaling framework is proposed to enrich documents with metadata and keywords, improving retrieval performance, particularly for short-form queries. The work highlights a misalignment between agent-driven sub-queries (often keyword-like) and the semantic strengths of LLM-based retrievers, suggesting that retriever-agent co-design and corpus augmentation are critical for practical deep-research systems. Overall, Sage provides a controlled, up-to-date evaluation environment and demonstrates that document-side signals can robustly boost retrieval when paired with off-the-shelf retrievers, while acknowledging limitations and avenues for future alignment between agents and retrieval backends.

Abstract

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
Paper Structure (51 sections, 1 equation, 16 figures, 5 tables)

This paper contains 51 sections, 1 equation, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Sage task overview. Given a complex question, the deep research agent (e.g., DR Tulu) iteratively reasons, generates keyword-based sub-queries, searches for relevant papers, and outputs a final answer. We first evaluate the agents with their native web-search tool, and then modify DR Tulu's MCP service to replace web search with retrievers that performs corpus search over our paper collection.
  • Figure 2: Overview of short-form questions that require intensive reasoning over metadata, paper details and inter-paper relationships. Each question consists of three parts and has only one ground-truth answer.
  • Figure 3: Overview of open-ended questions that are grounded on real-world scenarios. Each question consists of three parts and has multiple ground-truth papers weighted by their relevance.
  • Figure 4: An illustrative case where LLM-based retrieval fails due to semantic drift. The query seeks a paper that uses physics-informed heuristics. ReasonIR over-emphasizes title-level keywords (highlighted in red) and thus retrieves wrong papers. The retrieved content then reinforces this focus in subsequent retrieval steps, creating a feedback loop that increasingly prioritizes "physics-informed" in title. In contrast, BM25 remains anchored by lexical matching in similar sub-queries and avoids this drift.
  • Figure 5: Example of a Short-Form question.
  • ...and 11 more figures