Table of Contents
Fetching ...

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Yilun Zhao, Jinbiao Wei, Tingyu Song, Siyue Zhang, Chen Zhao, Arman Cohan

Abstract

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Abstract

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

Paper Structure

This paper contains 56 sections, 6 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Overview of our work. Left:Bright-Pro augments Bright with re-audited gold passages and reasoning-aspect-level labels, enabling retriever evaluation under both static and agentic search protocols. Right: RTriever is trained on RTriever-Synth. RTriever-Synth rewrites MS MARCO queries into DeepResearch-style queries, generates reference answers and decomposes them into non-overlapping reasoning aspects, then synthesizes complementary positives for each aspect along with positive-conditioned hard negatives for LoRA fine-tuning.
  • Figure 2: An overview of the Bright-Pro benchmark construction pipeline.
  • Figure 3: Prompt to run deep research agent.
  • Figure 4: Prompt to generate the final response after a fixed round of retrieval. At each fixed round $r\!\in\!\{1,2,3\}$, {EvidenceDocuments} is the concatenation of all documents retrieved through round $r$.
  • Figure 5: Prompt for reference answer generation, showing input structure and output specification.
  • ...and 6 more figures