Table of Contents
Fetching ...

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen

TL;DR

ReSeek tackles the reliability gap in RL-based search agents by introducing a self-correcting loop centered on a JUDGE action and a dense, instructive reward that separately incentivizes factual correctness and the relevance of retrieved information. It pairs this with a structured prompting scheme to enforce iterative self-assessment, and a contamination-resistant FictionalHot benchmark to evaluate reasoning over memorization. Empirically, ReSeek achieves state-of-the-art performance across eight open-domain QA benchmarks and demonstrates robust self-correction, especially in multi-hop settings, with additional gains from larger LLMs and live web search. Overall, the work advances practical, faithful search agents and calls for standardized, contamination-aware evaluation via the Hot Benchmark principle.

Abstract

Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

TL;DR

ReSeek tackles the reliability gap in RL-based search agents by introducing a self-correcting loop centered on a JUDGE action and a dense, instructive reward that separately incentivizes factual correctness and the relevance of retrieved information. It pairs this with a structured prompting scheme to enforce iterative self-assessment, and a contamination-resistant FictionalHot benchmark to evaluate reasoning over memorization. Empirically, ReSeek achieves state-of-the-art performance across eight open-domain QA benchmarks and demonstrates robust self-correction, especially in multi-hop settings, with additional gains from larger LLMs and live web search. Overall, the work advances practical, faithful search agents and calls for standardized, contamination-aware evaluation via the Hot Benchmark principle.

Abstract

Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

Paper Structure

This paper contains 24 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A comparison of reasoning processes on a multi-hop question about an obscure entity. Standard RAG (a) fails as it cannot perform sequential reasoning. Vanilla agent like Search-R1 (b) reasons sequentially but gets stuck on its initial path. In contrast, our agent (c) demonstrates robust self-correction: it uses a low process reward $(r_p)$ to identify the unproductive intermediate step, triggers a judge action to revise its strategy, and successfully navigates to the correct answer. The full trace for this example is provided in Appendix \ref{['appendix_case2']}.
  • Figure 2: Training the agent's self-evaluation capability. We train the agent via policy optimization to master the judge action. A reward signal is generated by comparing the agent's judgment against an "ideal" one, which is determined by the rerank score between the current search observation and the GT answer. This reward guides the policy to learn effective self-correction.
  • Figure 3: The FictionalHot benchmark construction process: transforming a real-world question answer sample into a fictional sample with fictional question and documents.
  • Figure 4: Ablation study on the effect of the number of turns on model performance. We evaluate multiple methods with turn budgets from 1 to 4 using qwen2.5-3b-instruct, reporting the average performance across all datasets.
  • Figure 5: Ablation study on search-embedding choice and base/instruction models. We evaluate our method on the Wiki18 corpus across different backbone and embedding models over all datasets. The dashed line denotes the mean performance (excluding BM25).
  • ...and 5 more figures