SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Hansi Zeng; Zoey Li; Yifan Gao; Chenwei Zhang; Xiaoman Pan; Tao Yang; Fengran Mo; Jiacheng Lin; Xian Li; Jingbo Shang

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang

TL;DR

This work proposes SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL.

Abstract

Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

TL;DR

Abstract

Paper Structure (43 sections, 11 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 11 equations, 3 figures, 10 tables, 1 algorithm.

Introduction
Related Works
Reinforcement Learning with Verifiable Reward
Research Agents
Research Agent Learning
Problem Definition
Challenges in Agent Learning
SynPlanResearch-R1 Framework
Plan-Guided Data Synthesis
Step 1: Tool-Plan Construction.
Step 2: Cue-Injected Thoughts.
Step 3: Filtering and Quality Control.
Step 4: Thought Rewriting.
Reinforcement Learning with Cold-Start SFT
Cold-start SFT.
...and 28 more sections

Figures (3)

Figure 1: Overview of our data synthesis pipeline, which progresses from tool-plan generation, to cue-injected multi-step reasoning, and finally to trajectory filtering and quality control.
Figure 2: Relationship between average tool calls (x-axis) and task performance (y-axis; Exact Match, EM) across seven benchmarks; the last panel reports the macro average. Each method is shown at four RL checkpoints (20, 40, 60, and 80 steps).
Figure 3: Training dynamics of policy entropy (left) and reward (right) over RL steps for three regimes.

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

TL;DR

Abstract

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Authors

TL;DR

Abstract

Table of Contents

Figures (3)