Table of Contents
Fetching ...

InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao, Siqi Bao, Jun Zhao, Kang Liu

TL;DR

Reward sparsity in reinforcement-learning-based agentic deep search is a core barrier for scalable learning. InfoFlow tackles this by combining sub-goal scaffolding, adaptive pathfinding hints, and a dual-agent trajectory refinement framework that splits reasoning from evidence synthesis, initialized via rejection sampling fine-tuning. The method yields denser, process-level supervision and more stable policy optimization, demonstrated by superior generalization on QA tasks and strong performance on complex BrowseComp-Plus benchmarks—sometimes matching much larger proprietary models with smaller backbones. Ablation and depth analyses underscore the importance of the dual-agent setup, sub-goal rewards, and hints in driving learning efficiency and robust long-horizon reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.

InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

TL;DR

Reward sparsity in reinforcement-learning-based agentic deep search is a core barrier for scalable learning. InfoFlow tackles this by combining sub-goal scaffolding, adaptive pathfinding hints, and a dual-agent trajectory refinement framework that splits reasoning from evidence synthesis, initialized via rejection sampling fine-tuning. The method yields denser, process-level supervision and more stable policy optimization, demonstrated by superior generalization on QA tasks and strong performance on complex BrowseComp-Plus benchmarks—sometimes matching much larger proprietary models with smaller backbones. Ablation and depth analyses underscore the importance of the dual-agent setup, sub-goal rewards, and hints in driving learning efficiency and robust long-horizon reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.

Paper Structure

This paper contains 44 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The framework of InfoFlow and example of DSQA task. Researcher agent focuses on reasoning and planning, refiner agent synthesizes massive searched content into condensed info.
  • Figure 2: Dual agent framework enhances reward density: achieving higher accuracy with less context.
  • Figure 3: Analysis of Reasoning Depth.
  • Figure 4: Case study 1 (Sensation novel): An example of enriched InfoSeek dataset. The hints decompose the main question into more manageable, high-leverage search queries that serve as off-policy guidance.
  • Figure 5: Case study 2 (Contemporary Concepts): An example of enriched InfoSeek dataset. The hints decompose the main question into more manageable, high-leverage search queries that serve as off-policy guidance.
  • ...and 1 more figures