Table of Contents
Fetching ...

CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, Dongbin Zhao

TL;DR

CriticSearch tackles the sparsity of outcome rewards in Tool-Integrated Reasoning with search by introducing a retrospective critique LLM that assigns dense, turn-level rewards to each action in a multi-turn reasoning trajectory. A frozen critique model inspects the full trajectory and gold answer, producing per-turn judgments that complement the global reward, yielding a hybrid advantage that guides policy optimization via Group Relative Policy Optimization (GRPO). Empirical results on four diverse multi-hop QA benchmarks show CriticSearch achieves faster convergence, improved training stability, and higher accuracy than prior dense-reward and sparse-reward baselines, across multiple model scales. The approach demonstrates strong generalization and robustness, with practical implications for more efficient and reliable tool-using reasoning in language models, albeit with increased memory/compute overhead and scope limited to iterative search settings.

Abstract

Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.

CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

TL;DR

CriticSearch tackles the sparsity of outcome rewards in Tool-Integrated Reasoning with search by introducing a retrospective critique LLM that assigns dense, turn-level rewards to each action in a multi-turn reasoning trajectory. A frozen critique model inspects the full trajectory and gold answer, producing per-turn judgments that complement the global reward, yielding a hybrid advantage that guides policy optimization via Group Relative Policy Optimization (GRPO). Empirical results on four diverse multi-hop QA benchmarks show CriticSearch achieves faster convergence, improved training stability, and higher accuracy than prior dense-reward and sparse-reward baselines, across multiple model scales. The approach demonstrates strong generalization and robustness, with practical implications for more efficient and reliable tool-using reasoning in language models, albeit with increased memory/compute overhead and scope limited to iterative search settings.

Abstract

Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.

Paper Structure

This paper contains 32 sections, 11 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: CriticSearch achieves leading performance on most of the datasets compared with RL methods.
  • Figure 2: Overview of CriticSearch. The policy LLM interacts with external tools during multi-turn reasoning to generate a rollout trajectory. A frozen LLM with privileged information retrospectively evaluates each action, producing dense rewards that complement the sparse outcome reward. The resulting hybrid advantage signal provides fine-grained feedback, effectively mitigating reward sparsity in agentic RL.
  • Figure 3: An example from CriticSearch illustrating the evaluation process of the Critique model. In this trajectory, the final answer is correct but involves redundant search actions. A frozen critique LLM, after deliberation, provides accurate binary rewards for each turn.
  • Figure 4: Overview of the Critique LLM. The entire reasoning trajectory, along with the gold answer, is fed into the critique LLM, which evaluates each action as either Good or Bad and generates a corresponding turn-level reward sequence after thinking.
  • Figure 5: Ablation study on the weight of the turn-level advantage $\alpha$. A larger $\alpha$ leads to faster convergence during the training phase.
  • ...and 5 more figures