How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Yinuo Xu; Shuo Lu; Jianjie Cheng; Meng Wang; Qianlong Xie; Xingxing Wang; Ran He; Jian Liang

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He, Jian Liang

TL;DR

A systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization reveals that the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work.

Abstract

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 11 figures, 11 tables)

This paper contains 21 sections, 2 equations, 11 figures, 11 tables.

Introduction
Deep Research
Prompt Template
Setup
The Less Thinking, the Better Performance
Fast vs. Slow Thinking Templates
Reward Function
Setup
Is F1 really better than EM?
Why Does Training Collapse?
Revitalizing F1 through Action Supervision
Policy Optimization
Setup
REINFORCE vs PPO vs GRPO
A strong baseline: Search-R1++
...and 6 more sections

Figures (11)

Figure 1: An example of the Search-R1 generation pipeline.
Figure 2: (a) demonstrates Deep Research's RL training pipeline; (b) shows an overview of the three key aspects explored in our work: prompt template, reward function, and policy optimization.
Figure 3: (a) Accuracy under varying information tokens; (b) Accuracy under varying reasoning tokens.
Figure 4: (a) compares the training score under Fast and Slow Thinking templates; (b) shows the average response length evolution over training steps; (c) illustrates the surge in <think> tags coinciding with the performance collapse.
Figure 5: Overall accuracy, answered-only accuracy, and answer rate (shaded area) under F1 reward.
...and 6 more figures

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

TL;DR

Abstract

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Authors

TL;DR

Abstract

Table of Contents

Figures (11)