Table of Contents
Fetching ...

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

Wenlin Zhong, Jinluan Yang, Yiquan Wu, Yi Liu, Jianhang Yao, Kun Kuang

TL;DR

SIGHT addresses the tunnel vision problem in multi-turn QA by coupling Self-Evidence Support (SES) with Information-Gain Driven Diverse Branching, guiding when and how to branch through Dynamic Prompting Interventions. The IG score, defined as Score_{IG}(o_t) = \log P(y^* \mid \mathcal{H}_t, o_t) - \log P(y^* \mid \mathcal{H}_t), identifies pivotal states while SES distills noisy observations into evidence, and the system uses hierarchical rewards within Group Relative Policy Optimization to internalize robust exploration without external verifiers. Through SES rollout generation, IG-guided branching, and carefully designed prompts, SIGHT achieves higher accuracy with fewer search steps across single-hop and multi-hop QA benchmarks, markedly reducing tool calls in complex reasoning tasks. The framework demonstrates strong scalability (3B and 7B backbones) and robustness to long-horizon reasoning, offering a practical path to more reliable open-domain QA systems under noisy retrieval conditions.

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

TL;DR

SIGHT addresses the tunnel vision problem in multi-turn QA by coupling Self-Evidence Support (SES) with Information-Gain Driven Diverse Branching, guiding when and how to branch through Dynamic Prompting Interventions. The IG score, defined as Score_{IG}(o_t) = \log P(y^* \mid \mathcal{H}_t, o_t) - \log P(y^* \mid \mathcal{H}_t), identifies pivotal states while SES distills noisy observations into evidence, and the system uses hierarchical rewards within Group Relative Policy Optimization to internalize robust exploration without external verifiers. Through SES rollout generation, IG-guided branching, and carefully designed prompts, SIGHT achieves higher accuracy with fewer search steps across single-hop and multi-hop QA benchmarks, markedly reducing tool calls in complex reasoning tasks. The framework demonstrates strong scalability (3B and 7B backbones) and robustness to long-horizon reasoning, offering a practical path to more reliable open-domain QA systems under noisy retrieval conditions.

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
Paper Structure (23 sections, 10 equations, 10 figures, 7 tables)

This paper contains 23 sections, 10 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Contrast between (top) noise-sensitive entropy exploration and (bottom) SIGHT's noise-resilient, Information Gain-driven framework.
  • Figure 2: The Overall framework of SIGTHT. (a) Self-Evidence Support (SES) based rollout generation for active noise filtration: After every search action, the agent autonomously distills the raw observation $o_t$ (e.g., single-turn search result), into a noise-free evidence snippet $e_t$ within a <self-evidence> tag; (b) Information-Gain Driven Branching for Exploration: The system first performs pivotal state evalution to calculate an Information Gain (IG) score to quantify the value of the current state. Based on the IG score and interaction history, specific Dynamic Prompting Interventions are injected to guide the next reasoning step—triggering de-duplication, reflection, or adaptive branching. SIGHT employs a continuous filtering and monitoring mechanism at each interaction step $t$.
  • Figure 3: Training dynamics comparison among Search-R1, ARPO and our proposed SIGHT, where we report the Tool Calls and Response Length of trained models.
  • Figure 4: Training evolution analysis of Reasoning Performance (EM) and Search Cost (TC) across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct on evaluation datasets including Single-Hop and Multi-Hop questions.
  • Figure 5: Training and Inference Prompts for SIGHT
  • ...and 5 more figures