Table of Contents
Fetching ...

DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao

TL;DR

DSPO tackles instability and sample inefficiency in RL-enabled agentic search for LLMs by unifying sequence-level optimization with dynamic outcome-based filtering. By aligning the unit of optimization with the trajectory-level reward and ensuring diverse learning signals within batches, DSPO achieves strong performance on multi-turn QA benchmarks with a 7B model, even rivaling larger baselines. Empirical results show substantial gains over prior methods (e.g., 34.1% relative over a comparable 7B model and near 9% relative on HotpotQA) and indicate robust training stability absent in token-level approaches. The approach relies on a BM25 retriever and demonstrates the viability of RL-only training for autonomous search and reasoning, with promising implications for scalable, data-efficient agentic systems.

Abstract

Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.

DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

TL;DR

DSPO tackles instability and sample inefficiency in RL-enabled agentic search for LLMs by unifying sequence-level optimization with dynamic outcome-based filtering. By aligning the unit of optimization with the trajectory-level reward and ensuring diverse learning signals within batches, DSPO achieves strong performance on multi-turn QA benchmarks with a 7B model, even rivaling larger baselines. Empirical results show substantial gains over prior methods (e.g., 34.1% relative over a comparable 7B model and near 9% relative on HotpotQA) and indicate robust training stability absent in token-level approaches. The approach relies on a BM25 retriever and demonstrates the viability of RL-only training for autonomous search and reasoning, with promising implications for scalable, data-efficient agentic systems.

Abstract

Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.

Paper Structure

This paper contains 25 sections, 9 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of the DSPO training loop. For a given query, the policy model generates a group of $G$ trajectories by interacting with the search environment. Each trajectory is assigned a sparse terminal reward. The dynamic filter discards groups with homogeneous outcomes and keep sampling until a batch is filled, ensuring that every training batch provides a effective advantage signal. Advantages are computed and used to update the policy model via sequence-level objective.
  • Figure 2: Validation performance of DSPO across seven benchmarks during training. The steady, monotonic increase in accuracy confirms that DSPO's reward improvement translates directly to enhanced generalization and that our method learns a robust search-and-reasoning policy.
  • Figure 3: Training reward dynamics of DSPO and its ablations. Comparative view of learning curves. DSPO (red) demonstrates stable and monotonic improvement. In contrast, token-level variants (green, blue) suffer catastrophic policy collapse, while the sequence-level variant without our filter (purple) plateaus at a suboptimal level.