Table of Contents
Fetching ...

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang

TL;DR

This work proposes SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL.

Abstract

Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

TL;DR

This work proposes SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL.

Abstract

Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.
Paper Structure (43 sections, 11 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 11 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our data synthesis pipeline, which progresses from tool-plan generation, to cue-injected multi-step reasoning, and finally to trajectory filtering and quality control.
  • Figure 2: Relationship between average tool calls (x-axis) and task performance (y-axis; Exact Match, EM) across seven benchmarks; the last panel reports the macro average. Each method is shown at four RL checkpoints (20, 40, 60, and 80 steps).
  • Figure 3: Training dynamics of policy entropy (left) and reward (right) over RL steps for three regimes.