Table of Contents
Fetching ...

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou, Dorien Herremans, Soujanya Poria

TL;DR

WebDetective tackles the problem that current deep-search benchmarks leak hints and rely on single-pass evaluation by introducing hint-free multi-hop questions within a controlled Wikipedia sandbox. It presents a diagnostic framework that decouples knowledge sufficiency, search effectiveness, and generation/refusal quality, complemented by the EvidenceLoop agentic workflow with memory and verification to address synthesis bottlenecks. A 25-model evaluation on 200 questions reveals systematic weaknesses in knowledge utilisation and calibrated refusal, underscoring that modern systems struggle to discover reasoning chains autonomously rather than merely execute provided paths. The proposed framework and EvidenceLoop baseline demonstrate meaningful, though partial, improvements and offer a generalizable approach for developing genuinely autonomous reasoning systems across domains beyond Wikipedia.

Abstract

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

TL;DR

WebDetective tackles the problem that current deep-search benchmarks leak hints and rely on single-pass evaluation by introducing hint-free multi-hop questions within a controlled Wikipedia sandbox. It presents a diagnostic framework that decouples knowledge sufficiency, search effectiveness, and generation/refusal quality, complemented by the EvidenceLoop agentic workflow with memory and verification to address synthesis bottlenecks. A 25-model evaluation on 200 questions reveals systematic weaknesses in knowledge utilisation and calibrated refusal, underscoring that modern systems struggle to discover reasoning chains autonomously rather than merely execute provided paths. The proposed framework and EvidenceLoop baseline demonstrate meaningful, though partial, improvements and offer a generalizable approach for developing genuinely autonomous reasoning systems across domains beyond Wikipedia.

Abstract

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Paper Structure

This paper contains 39 sections, 13 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of different question formulations in multi-hop deep search. Left: Path-Hinting (PH) benchmarks such as HotpotQA embed the reasoning path directly in the question text, effectively reducing reasoning to execution. Middle: Specification-Hinting (SH) benchmarks such as BrowseComp obscure the target entity behind multiple attributes, testing filtering rather than autonomous exploration. Right: Our Hint-Free (HF) formulation in WebDetective removes both path and specification hints, requiring agents to autonomously discover reasoning chains within a controlled Wikipedia sandbox.
  • Figure 2: Test-time scaling (TTS) on WebDetective.Left: Increasing Claude-Opus-4.1’s context length boosts Search Score and Pass@1 at shorter ranges, but both plateau beyond 32k tokens, while Generation Score shows only modest gains. Right: EvidenceLoop remains stable across breadth–iteration settings, with Pass@1 improving under moderate configurations (e.g., 1--2, 3--2). Generation Score changes little, highlighting synthesis—not search—as the main bottleneck under TTS.
  • Figure 3: Dataset statistics for WebDetective benchmark. The figure shows: (a) Distribution of question types, (b) Number of entities per question, (c) Evidence count distribution, (d) Question and answer length in characters, (e) Hop length distribution by analysis setting, and (f) Search query usage patterns. The dataset exhibits controlled complexity with predominantly 2-3 hop questions while maintaining challenging longer chains.
  • Figure 4: Overview of the EvidenceLoop framework. The system employs parallel solver and extractor agents that perform search and reasoning to generate proposals with supporting claims. These proposals are then verified through memory retrieval, with the aggregated context fed into subsequent iterations until verification succeeds or a termination condition is met.