Table of Contents
Fetching ...

RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context

Yu Lei, Shuzheng Si, Wei Wang, Yifei Wu, Gang Chen, Fanchao Qi, Maosong Sun

TL;DR

RhinoInsight tackles the challenge of long-horizon deep research with error propagation and context rot by introducing two control mechanisms that regulate both model behavior and context. The Verifiable Checklist constrains planning through executable, verifiable sub-goals, while the Evidence Audit structures and audits context to bind high-quality evidence to claims and visuals, all without updating model parameters. The framework extends ReAct with a five-component loop and a memory-reconstruction strategy, achieving state-of-the-art performance on DeepResearch benchmarks (e.g., $R=50.92$ on the DeepResearch Bench) and competitive results on deep search tasks, including GAIA text-only. These contributions demonstrate that principled control over actions and context can substantially improve robustness, traceability, and accuracy in AI-assisted deep research systems, with potential for stronger reliability and real-world deployment; future work includes adaptive control policies and human-in-the-loop refinements to further boost reliability and efficiency.

Abstract

Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.

RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context

TL;DR

RhinoInsight tackles the challenge of long-horizon deep research with error propagation and context rot by introducing two control mechanisms that regulate both model behavior and context. The Verifiable Checklist constrains planning through executable, verifiable sub-goals, while the Evidence Audit structures and audits context to bind high-quality evidence to claims and visuals, all without updating model parameters. The framework extends ReAct with a five-component loop and a memory-reconstruction strategy, achieving state-of-the-art performance on DeepResearch benchmarks (e.g., on the DeepResearch Bench) and competitive results on deep search tasks, including GAIA text-only. These contributions demonstrate that principled control over actions and context can substantially improve robustness, traceability, and accuracy in AI-assisted deep research systems, with potential for stronger reliability and real-world deployment; future work includes adaptive control policies and human-in-the-loop refinements to further boost reliability and efficiency.

Abstract

Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.

Paper Structure

This paper contains 30 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: RhinoInsight shows the state-of-the-art performance in deep research tasks compared with proprietary systems, while remaining competitive on deep search tasks.
  • Figure 2: (a) Linear Deep Research Workflow. (b) RhinoInsight with two enhanced modules: (1) a Verifiable Checklist to turn the query into goals and constrain planning; (2) an Evidence Audit to organize memory, summarize context, and extract evidence to the outline and writing.
  • Figure 3: RhinoInsight in practice: comparative evaluation of Flutter versus other cross‑platform frameworks.