Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia; Ming Xu; Lingxiang Hu; Yiding Sun; Wenwei Li; Linfang Shang; Liqun Liu; Peng Shu; Huan Yu; Jie Jiang

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

TL;DR

This work proposes Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

TL;DR

Abstract

Paper Structure (53 sections, 8 equations, 10 figures, 13 tables)

This paper contains 53 sections, 8 equations, 10 figures, 13 tables.

Introduction
Related Work
Prompt-Based Agentic RAG.
RL-Based Agentic RAG.
Methodology
Problem Formulation
Path-Centric Reward
Reference Planner Generation
Dual-Track Path Scoring
Soft Outcome Scoring
Experiments
Experimental Setup
Datasets.
Models.
Evaluation Metric.
...and 38 more sections

Figures (10)

Figure 1: Performance comparison of Search-P1 against baselines on QA benchmarks. Our method achieves the highest average accuracy across all datasets on both (a) Qwen2.5-7B and (b) Qwen2.5-3B models.
Figure 2: Overview of Search-P1 framework. Our approach introduces path-centric reward shaping for agentic RAG training, comprising: (1) Dual-Track Path Scoring that evaluates trajectories from both self-consistency and reference-alignment perspectives, and (2) Soft Outcome Scoring that extracts training signals even from incorrect answers.
Figure 3: Training dynamics comparison of different format reward strategies. Soft Format (our buffered design) achieves faster ACC improvement and higher stable rewards compared to Strict Format (zero reward for invalid format) and Without Format baseline.
Figure 4: Effect of soft outcome scoring across datasets. Gray bars show accuracy without soft scoring (binary outcome), blue bars show accuracy with soft scoring. Per-dataset results are in Appendix \ref{['sec:appendix_soft_outcome']}.
Figure 5: Hyperparameter sensitivity analysis. All rewards are averaged over steps 195--205. (a) Effect of path reward weight $\lambda_p$. (b) Effect of accuracy weight $\lambda_a$. Per-dataset results are in Appendix \ref{['sec:appendix_hyperparameter']}.
...and 5 more figures

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

TL;DR

Abstract

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Authors

TL;DR

Abstract

Table of Contents

Figures (10)