Table of Contents
Fetching ...

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Yiding Wang, Zhepei Wei, Xinyu Zhu, Yu Meng

TL;DR

The paper tackles the inadequacy of outcome-only reinforcement learning in training search-enabled LLM agents. It introduces DeSA, a two-stage framework that first optimizes search recall (Stage 1 with $R_{ ext{recall}}$) and then optimizes final answers (Stage 2 with $R_{ ext{EM}}$) via GRPO, showing reduced deficient search behaviors and improved QA accuracy across seven benchmarks. Empirical results demonstrate substantial gains over single-stage baselines and highlight the importance of decoupling search and answering for robust tool use and information gathering. The work suggests a broader implication: process-based rewards can significantly enhance agentic capabilities beyond QA tasks, with potential extensions to code generation and long-context reasoning.

Abstract

Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall-based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

TL;DR

The paper tackles the inadequacy of outcome-only reinforcement learning in training search-enabled LLM agents. It introduces DeSA, a two-stage framework that first optimizes search recall (Stage 1 with ) and then optimizes final answers (Stage 2 with ) via GRPO, showing reduced deficient search behaviors and improved QA accuracy across seven benchmarks. Empirical results demonstrate substantial gains over single-stage baselines and highlight the importance of decoupling search and answering for robust tool use and information gathering. The work suggests a broader implication: process-based rewards can significantly enhance agentic capabilities beyond QA tasks, with potential extensions to code generation and long-context reasoning.

Abstract

Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall-based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.

Paper Structure

This paper contains 31 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of deficient search behaviors of agents trained with outcome-only supervision and an illustration of our DeSA (Decoupling Search-and-Answering). (Left) The results shown are collected from an agent trained solely based on a final-answer exact match (EM) reward with Qwen2.5-3B-Instruct as the backbone, and evaluated across seven QA datasets. This agent exhibits a variety of deficient search behaviors, including "Fail to Search", "w/ Invalid Searches", and "w/ Duplicate Queries". Compared to "Effective Search", these behaviors lead to significantly lower search recall and EM rate. (Right) DeSA decouples training into two stages to address these issues.
  • Figure 2: Impact of deficient search behaviors on agent performance. Both recall rate (left) and Exact Match (EM) rate (right) are significantly lower for trajectories exhibiting deficient behaviors compared to those with only effective behaviors.
  • Figure 3: Deficient Search Behaviors in Recall-Failure Cases. This figure displays the distribution of the three defined deficient search behaviors, as well as their combinations, within all search trajectories that failed to recall the ground-truth answer.
  • Figure 4: Final performance comparison of DeSA vs. the single-stage Search-R1 baseline on the 3B model.
  • Figure 5: Performance breakdown across DeSA's two stages, compared with Search-R1 baseline.
  • ...and 2 more figures