Table of Contents
Fetching ...

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong

TL;DR

Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets are introduced.

Abstract

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

TL;DR

Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets are introduced.

Abstract

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.
Paper Structure (42 sections, 2 equations, 14 figures, 3 tables)

This paper contains 42 sections, 2 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Reasoning-Aware Retrieval (AgentIR-4B) vs. conventional retrieval (Qwen3-Embedding-4B) for a task from BrowseComp-Plus, paired with the Tongyi-DR agent. The task has been simplified for display.
  • Figure 2: Oracle reranking procedure used in DR-Synth (Section \ref{['sec:labels']})
  • Figure 3: Effect of embedding $k$ history turns. We fix the Agent to Tongyi-DR. Plot (a) shows the end-to-end accuracy of embedding past $k$ turns, where "None" denotes the "AgentIR-4B w/o Reasoning" entry in Table \ref{['tab:ablation_components']}. Plot (b) shows the ratio of unique clues covered by using $k$ most recent reasonings, among all clues that have been covered. This is averaged across all trajectories for the $k=\text{all}$ setting (Section \ref{['sec:num-turns']}).
  • Figure 4: (a) Reasoning for the query in Figure \ref{['fig:teaser']} after identifying the candidate artist, Otto Knows from Sweden. (b) Average number of correct vs. incorrect claims (hypotheses) using $k$ most recent reasonings. This is averaged across trajectories for the $k$ = all setting (Section \ref{['sec:num-turns']})
  • Figure 5: The prompt template used to embed for AgentIR-4B. At turn $t$, we fill in {reasoning} with $\tau_t$ and {query} with $q_t$. Note that the duplicate "Query:" is intentional due to Qwen3-Embedding-4B qwen3-embed's instruction format.
  • ...and 9 more figures