Improving Search Agent with One Line of Code

Jian Li; Dongsheng Chen; Zhenhua Xu; Yizhang Jin; Jiafu Wu; Chengjie Wang; Xiaotong Yuan; Yabiao Wang

Improving Search Agent with One Line of Code

Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang

TL;DR

This work proposes SAPO, which stabilizes training via a conditional token-level KL constraint, which selectively penalizes the KL divergence between the current and old policies, thereby preventing distribution drift while preserving gradient flow.

Abstract

Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

Improving Search Agent with One Line of Code

TL;DR

Abstract

Paper Structure (35 sections, 1 theorem, 13 equations, 4 figures, 7 tables)

This paper contains 35 sections, 1 theorem, 13 equations, 4 figures, 7 tables.

Introduction
Preliminaries
Proximal Policy Optimization
Group Relative Policy Optimization
Approach
Search Agent Formulation
Importance Sampling Distribution Drift
Why ISDD Causes Collapse.
Search Agent Policy Optimization
Gradient Analysis
Experiments
Experiment Settings
Datasets.
Baselines.
Implementation Details.
...and 20 more sections

Key Result

Proposition 1

Let the IS ratios $r_t$ follow a log-normal distribution $\log r_t \sim \mathcal{N}(\mu, \sigma^2)$, with the drift parameter $\lambda = \mu + {\sigma^2}/{2}$. Due to the low-entropy and bottleneck nature of tool selection, action tokens $a_i$ exhibit significantly higher sensitivity to policy shift where $L$ denote the total number of tokens. ISDD is more severe in agent tasks than in QA tasks be

Figures (4)

Figure 1: Comparison of training dynamics between SAPO and GRPO regarding (a) Importance Sampling Ratio, (b) Clip Ratio, (c) Entropy, and (d) Reward.
Figure 2: (a) Hyperparameter sensitivity analysis. (b,c) Scaling trends of Qwen2.5-Instruct with SAPO across different model sizes (1.5B to 14B). We report both EM and F1 scores.
Figure 3: Prompt template for SAPO.
Figure 4: Evolution of EM accuracy over training steps for SAPO across seven benchmarks.

Theorems & Definitions (2)

Definition 1: ISDD
Proposition 1: ISDD Amplification in Interleaved Multi-Step Interactions

Improving Search Agent with One Line of Code

TL;DR

Abstract

Improving Search Agent with One Line of Code

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)