Table of Contents
Fetching ...

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

Jiahe Jin, Abhijay Paladugu, Chenyan Xiong

TL;DR

This paper identifies four beneficial reasoning behaviors for agentic search—Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery—via an automatic trajectory-analysis pipeline. It introduces Behavior Priming, a supervised fine-tuning strategy that injects these behaviors into models before RL, achieving significantly higher post-RL performance than baselines across web and multi-hop QA benchmarks. Critically, the authors show that the presence of reasoning behaviors, rather than mere final answer correctness, drives RL gains, as evidenced by ablations where incorrect-but-behavior-rich trajectories yield comparable improvements. The work also demonstrates that behavior priming enhances exploration and test-time scaling, scales with more SFT data, and benefits from composite behavior priming rather than single-behavior priming, offering a practical pathway to more capable agentic search systems.

Abstract

Agentic search leverages LLMs to solve complex user information needs by executing a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs' agentic reasoning capabilities when interacting with search systems. In this paper, we propose an LLM-based pipeline to study effective reasoning behavior patterns in agentic search by analyzing agentic search trajectories. Using this pipeline, we identify four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Based on these findings, we propose a technique called Behavior Priming to train agentic search models. It synthesizes trajectories that exhibit these four behaviors and integrates them into the agentic search model through SFT, followed by standard reinforcement learning. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct across three web benchmarks and seven multi-hop QA benchmarks demonstrate that behavior priming 1) yields significant performance gains compared to training with direct RL, and 2) outperforms other SFT-then-RL baselines, such as those SFT on randomly selected trajectories or on trajectories with merely correct outcomes. Crucially, we demonstrate that the reasoning behaviors, rather than the correctness of the final answer, is the critical factor for achieving strong performance in RL: SFT on trajectories with reasoning behaviors but incorrect answers leads to comparable performance with SFT on those with reasoning behaviors and correct answers. Our analysis further reveals that the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior_Priming_For_Agentic_Search.

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

TL;DR

This paper identifies four beneficial reasoning behaviors for agentic search—Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery—via an automatic trajectory-analysis pipeline. It introduces Behavior Priming, a supervised fine-tuning strategy that injects these behaviors into models before RL, achieving significantly higher post-RL performance than baselines across web and multi-hop QA benchmarks. Critically, the authors show that the presence of reasoning behaviors, rather than mere final answer correctness, drives RL gains, as evidenced by ablations where incorrect-but-behavior-rich trajectories yield comparable improvements. The work also demonstrates that behavior priming enhances exploration and test-time scaling, scales with more SFT data, and benefits from composite behavior priming rather than single-behavior priming, offering a practical pathway to more capable agentic search systems.

Abstract

Agentic search leverages LLMs to solve complex user information needs by executing a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs' agentic reasoning capabilities when interacting with search systems. In this paper, we propose an LLM-based pipeline to study effective reasoning behavior patterns in agentic search by analyzing agentic search trajectories. Using this pipeline, we identify four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Based on these findings, we propose a technique called Behavior Priming to train agentic search models. It synthesizes trajectories that exhibit these four behaviors and integrates them into the agentic search model through SFT, followed by standard reinforcement learning. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct across three web benchmarks and seven multi-hop QA benchmarks demonstrate that behavior priming 1) yields significant performance gains compared to training with direct RL, and 2) outperforms other SFT-then-RL baselines, such as those SFT on randomly selected trajectories or on trajectories with merely correct outcomes. Crucially, we demonstrate that the reasoning behaviors, rather than the correctness of the final answer, is the critical factor for achieving strong performance in RL: SFT on trajectories with reasoning behaviors but incorrect answers leads to comparable performance with SFT on those with reasoning behaviors and correct answers. Our analysis further reveals that the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior_Priming_For_Agentic_Search.

Paper Structure

This paper contains 39 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of different LLMs as the underlying agentic search model of our agent framework across four benchmarks. (a): the frequency of four behaviors in trajectories. (b): scores on benchmarks. Abbreviations: IV = Information Verification, AE = Authority Evaluation, AS = Adaptive Search, ER = Error Recovery.
  • Figure 2: Qwen3 1.7B + SFT (Random) and Qwen3 1.7B + Behavior Prime's behavior frequencies, pass@8 accuracy, and trajectories statistics (average step number and average search action number per trajectory) on the WebWalkerQA benchmark during the SFT process.
  • Figure 3: The entropy, validation accuracy, and valid action ratio trend during the RL process of Qwen3-1.7B and Qwen3-1.7B with behavior priming (SFT on the Behavior Prime dataset). The valid action ratio is the percentage of steps in which the model generates a syntactically valid action.
  • Figure 4: Qwen3-1.7B's performance after fine-tuning on different sizes of Behavior Priming's subset, and the corresponding performance after the subsequent RL training.