Table of Contents
Fetching ...

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, Jiawei Han

TL;DR

This paper empirically investigates design choices for reinforcement learning of LLM-based reasoning–search agents. It systematically evaluates reward designs (format vs. intermediate retrieval rewards), backbone LLM characteristics (general-purpose vs. reasoning-specialized) and model scale, and the impact of search engine quality during training and inference. Key findings show format rewards significantly boost performance and convergence, while intermediate retrieval rewards provide limited or negative gains; general-purpose LLMs and larger models generally perform better, with diminishing returns at scale; and stronger search engines during training and inference lead to more stable learning and better downstream results. These insights offer practical guidelines for building robust, real-world LLM-based search agents and point to promising directions like learned reward functions and broader tool-use RL.

Abstract

Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

TL;DR

This paper empirically investigates design choices for reinforcement learning of LLM-based reasoning–search agents. It systematically evaluates reward designs (format vs. intermediate retrieval rewards), backbone LLM characteristics (general-purpose vs. reasoning-specialized) and model scale, and the impact of search engine quality during training and inference. Key findings show format rewards significantly boost performance and convergence, while intermediate retrieval rewards provide limited or negative gains; general-purpose LLMs and larger models generally perform better, with diminishing returns at scale; and stronger search engines during training and inference lead to more stable learning and better downstream results. These insights offer practical guidelines for building robust, real-world LLM-based search agents and point to promising directions like learned reward functions and broader tool-use RL.

Abstract

Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.

Paper Structure

This paper contains 29 sections, 4 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Empirical analyses on format reward and intermediate retrieval reward. (a) Training reward curves with varying format reward scaling factors ($\lambda_f$); larger $\lambda_f$ values lead to faster convergence. (b) Impact of $\lambda_f$ on final model performance; a small $\lambda_f$ is ineffective, while an excessively large $\lambda_f$ may cause overfitting to format reward. (c) Training reward curves under different intermediate retrieval reward scaling factors ($\lambda_r$); varying $\lambda_r$ has limited effect on learning dynamics. (d) Effect of $\lambda_r$ on final model performance; increasing $\lambda_r$ degrades performance, suggesting that intermediate retrieval rewards are unnecessary, as the outcome reward sufficiently encourages effective query formulation. (LLM: Qwen2.5-7B-Base; RL Algorithm: PPO)
  • Figure 2: The study of underlying pretrained LLM for development of LLM-based search agents with RL. (a) Training reward with different type of LLMs - general-purpose LLM (Qwen2.5-7B-Base) and reasoning LLM (DeepSeek-R1-Distill-Qwen-7B). We observe that general-purpose LLM performs better than reasoning LLMs with both PPO and GRPO. (b) # of Search engine calls with different type of LLMs: General LLM learns to call the search engine faster than the reasoning LLM. This potentially stems from the fact the general LLMs are better for following instructions. (c) Training reward with different size of LLMs: Larger LLMs can lead to higher training reward. (d) Test accuracy with different size of LLMs: On the challenging Bamboogle dataset press2022measuring, the performance increases consistently as the LLM size increases.
  • Figure 3: Effect of Search Engine Choice on RL Training Dynamics. (a) Retrieval Quality Ranking: E5 (Exact) > E5 (HNSW) > BM25 > Random. (b) Training Stability and Performance: Stronger search engines (e.g., E5 Exact, E5 + HNSW) lead to more stable training and higher final performance, while weaker engines (e.g., Random, BM25) achieve suboptimal outcomes. (c) Search Engine Usage Behavior: With Random Noise, the agent quickly learns to avoid using the search engine. With BM25, the agent gradually increases search calls to compensate for limited retrieval quality. With E5, the agent issues search calls more strategically, reflecting more efficient search behavior.
  • Figure 4: Data scaling effects in RL training for search agents. (a) Training reward under PPO with varying dataset sizes: Smaller training sets result in faster convergence and higher training rewards, likely due to overfitting. (b) Number of search engine calls under PPO: Training with a single example fails to induce search behavior, while 10 samples lead to unstable learning. In contrast, using 100 or 1,000 samples enables the model to learn stable search behavior, and training with 10,000 samples further improves performance. (c) Training reward under GRPO with varying dataset sizes: Similar to PPO, smaller datasets yield faster convergence and higher rewards, again suggesting potential overfitting. (d) Number of search engine calls under GRPO: A single training sample is insufficient for search behavior to emerge, whereas larger datasets facilitate stable learning of search interactions.
  • Figure 5: The study of underlying pretrained LLM for development of search agents with RL. (a) Training reward with different type of LLMs - general-purpose LLM (Qwen2.5-14B-Base) and reasoning LLM (DeepSeek-R1-Distill-Qwen-14B). We observe that general-purpose LLM performs better than reasoning LLMs with both PPO and GRPO. (b) # of Search engine calls with different type of LLMs: Both the general-purpose LLM and the reasoning-specialized LLM demonstrate the ability to learn when to call the search engine. However, the general-purpose LLM achieves better final performance, potentially due to its superior capability in formulating effective search queries.