SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao
TL;DR
Sage introduces a reasoning-intensive benchmark for scientific literature retrieval and shows that, within deep-research agent workflows, traditional lexical retrieval (BM25) often surpasses modern LLM-based retrievers for both short-form and open-ended tasks. A corpus-level test-time scaling framework is proposed to enrich documents with metadata and keywords, improving retrieval performance, particularly for short-form queries. The work highlights a misalignment between agent-driven sub-queries (often keyword-like) and the semantic strengths of LLM-based retrievers, suggesting that retriever-agent co-design and corpus augmentation are critical for practical deep-research systems. Overall, Sage provides a controlled, up-to-date evaluation environment and demonstrates that document-side signals can robustly boost retrieval when paired with off-the-shelf retrievers, while acknowledging limitations and avenues for future alignment between agents and retrieval backends.
Abstract
Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
