Table of Contents
Fetching ...

Improving Search Agent with One Line of Code

Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang

TL;DR

This work proposes SAPO, which stabilizes training via a conditional token-level KL constraint, which selectively penalizes the KL divergence between the current and old policies, thereby preventing distribution drift while preserving gradient flow.

Abstract

Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

Improving Search Agent with One Line of Code

TL;DR

This work proposes SAPO, which stabilizes training via a conditional token-level KL constraint, which selectively penalizes the KL divergence between the current and old policies, thereby preventing distribution drift while preserving gradient flow.

Abstract

Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).
Paper Structure (35 sections, 1 theorem, 13 equations, 4 figures, 7 tables)

This paper contains 35 sections, 1 theorem, 13 equations, 4 figures, 7 tables.

Key Result

Proposition 1

Let the IS ratios $r_t$ follow a log-normal distribution $\log r_t \sim \mathcal{N}(\mu, \sigma^2)$, with the drift parameter $\lambda = \mu + {\sigma^2}/{2}$. Due to the low-entropy and bottleneck nature of tool selection, action tokens $a_i$ exhibit significantly higher sensitivity to policy shift where $L$ denote the total number of tokens. ISDD is more severe in agent tasks than in QA tasks be

Figures (4)

  • Figure 1: Comparison of training dynamics between SAPO and GRPO regarding (a) Importance Sampling Ratio, (b) Clip Ratio, (c) Entropy, and (d) Reward.
  • Figure 2: (a) Hyperparameter sensitivity analysis. (b,c) Scaling trends of Qwen2.5-Instruct with SAPO across different model sizes (1.5B to 14B). We report both EM and F1 scores.
  • Figure 3: Prompt template for SAPO.
  • Figure 4: Evolution of EM accuracy over training steps for SAPO across seven benchmarks.

Theorems & Definitions (2)

  • Definition 1: ISDD
  • Proposition 1: ISDD Amplification in Interleaved Multi-Step Interactions