Table of Contents
Fetching ...

SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

Chen Zhang, Kuicai Dong, Dexun Li, Wenjun Li, Qu Yang, Wei Han, Yong Liu

TL;DR

SRR-Judge addresses the lack of intermediate-step supervision in deep-search agents by introducing a step-level evaluator that scores and refines each thought–action pair within long-horizon trajectories. Integrated into a ReAct-based rate-and-refine workflow, it distills judgment capabilities from a large agentic model to a smaller, efficient judge (QwQ-32B), enabling both inference-time refinements and offline trajectory alignment via iterative rejection sampling. The approach demonstrates strong correlations between step-level ratings and final trajectory correctness, and yields substantial improvements in pass@1 across multiple real-world benchmarks when used for alignment, outperforming outcome-only supervision in several settings. These results highlight the practical value of step-level supervision for reliable search-integrated reasoning and scalable post-training improvement of agentic systems.

Abstract

Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.

SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

TL;DR

SRR-Judge addresses the lack of intermediate-step supervision in deep-search agents by introducing a step-level evaluator that scores and refines each thought–action pair within long-horizon trajectories. Integrated into a ReAct-based rate-and-refine workflow, it distills judgment capabilities from a large agentic model to a smaller, efficient judge (QwQ-32B), enabling both inference-time refinements and offline trajectory alignment via iterative rejection sampling. The approach demonstrates strong correlations between step-level ratings and final trajectory correctness, and yields substantial improvements in pass@1 across multiple real-world benchmarks when used for alignment, outperforming outcome-only supervision in several settings. These results highlight the practical value of step-level supervision for reliable search-integrated reasoning and scalable post-training improvement of agentic systems.

Abstract

Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.
Paper Structure (29 sections, 1 equation, 9 figures, 5 tables)

This paper contains 29 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The rate-and-refine inference workflow.
  • Figure 2: Distribution of step-level ratings.
  • Figure 3: Correlation performance of the judge models on first-round, last-round, and averaged step-level ratings.
  • Figure 4: Artist example from BrowseComp. Step 1 of a search-integrated reasoning trajectory generated by DeepSeek-R1, evaluated and refined by SRR-Judge.
  • Figure 5: Artist Example Step 10 - showing SRR-Judge intervening after prolonged search failure. Highlighted text (red) marks a degeneration into speculative enumeration and an under-specified query; SRR-Judge redirects the agent to pivot on a more searchable institutional clue (the art school) and then use the remaining biographical constraints for verification.
  • ...and 4 more figures