Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Yiding Wang; Zhepei Wei; Xinyu Zhu; Yu Meng

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

Yiding Wang, Zhepei Wei, Xinyu Zhu, Yu Meng

TL;DR

The paper tackles the inadequacy of outcome-only reinforcement learning in training search-enabled LLM agents. It introduces DeSA, a two-stage framework that first optimizes search recall (Stage 1 with $R_{ ext{recall}}$) and then optimizes final answers (Stage 2 with $R_{ ext{EM}}$) via GRPO, showing reduced deficient search behaviors and improved QA accuracy across seven benchmarks. Empirical results demonstrate substantial gains over single-stage baselines and highlight the importance of decoupling search and answering for robust tool use and information gathering. The work suggests a broader implication: process-based rewards can significantly enhance agentic capabilities beyond QA tasks, with potential extensions to code generation and long-context reasoning.

Abstract

Enabling large language models (LLMs) to utilize search tools offers a promising path to overcoming fundamental limitations such as knowledge cutoffs and hallucinations. Recent work has explored reinforcement learning (RL) for training search-augmented agents that interleave reasoning and retrieval before answering. These approaches usually rely on outcome-based rewards (e.g., exact match), implicitly assuming that optimizing for final answers will also yield effective intermediate search behaviors. Our analysis challenges this assumption: we uncover multiple systematic deficiencies in search that arise under outcome-only training and ultimately degrade final answer quality, including failure to invoke tools, invalid queries, and redundant searches. To address these shortcomings, we introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation. In Stage 1, agents are trained to improve search effectiveness with retrieval recall-based rewards. In Stage 2, outcome rewards are employed to optimize final answer generation. Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines. Notably, DeSA outperforms single-stage training approaches that simultaneously optimize recall and outcome rewards, underscoring the necessity of explicitly decoupling the two objectives.

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

TL;DR

Abstract

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)