Evaluating the Search Agent in a Parallel World

Jiawei Chen; Xintian Shen; Lihao Zheng; Lifu Mu; Haoyi Sun; Ning Mao; Hao Ma; Tao Wei; Pan Zhou; Kun Zhan

Evaluating the Search Agent in a Parallel World

Jiawei Chen, Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma, Tao Wei, Pan Zhou, Kun Zhan

TL;DR

Experiments across three evaluation settings show that, while search agents are strong at evidence synthesis given complete information, their performance is limited not only by evidence collection and coverage in unfamiliar search environments, but also by unreliable evidence sufficiency judgment and when-to-stop decisions-bottlenecks.

Abstract

Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from unreliable sources. Second, static benchmarks face dynamic obsolescence: as internet information evolves, complex queries requiring deep research often degrade into simple retrieval tasks due to increased popularity, and ground truths become outdated due to temporal shifts. Third, attribution ambiguity confounds evaluation, as an agent's performance is often dominated by its parametric memory rather than its actual search and reasoning capabilities. Finally, reliance on specific commercial search engines introduces variability that hampers reproducibility. To address these issues, we propose a novel framework, Mind-ParaWorld, for evaluating Search Agents in a Parallel World. Specifically, MPW samples real-world entity names to synthesize future scenarios and questions situated beyond the model's knowledge cutoff. A ParaWorld Law Model then constructs a set of indivisible Atomic Facts and a unique ground-truth for each question. During evaluation, instead of retrieving real-world results, the agent interacts with a ParaWorld Engine Model that dynamically generates SERPs grounded in these inviolable Atomic Facts. We release MPW-Bench, an interactive benchmark spanning 19 domains with 1,608 instances. Experiments across three evaluation settings show that, while search agents are strong at evidence synthesis given complete information, their performance is limited not only by evidence collection and coverage in unfamiliar search environments, but also by unreliable evidence sufficiency judgment and when-to-stop decisions-bottlenecks.

Evaluating the Search Agent in a Parallel World

TL;DR

Abstract

Paper Structure (46 sections, 3 equations, 3 figures, 6 tables)

This paper contains 46 sections, 3 equations, 3 figures, 6 tables.

Introduction
1. Dynamic Obsolescence of Static Benchmarks:
2. Attribution Ambiguity:
3. The Cost-Quality Paradox:
The Mind-ParaWorld Framework
Definition of the Four Stages of Agent Development
Overview of Mind-ParaWorld Framework
Construction of the ParaWorld Questions
Multi-fact dependency.
Parametric-memory isolation.
Anti-shortcut.
Constructions of the ParaWorld Laws
Representation of Atomic Facts.
Generation of the ParaWorld News
Query type classification: atomic vs. compound queries.
...and 31 more sections

Figures (3)

Figure 1: Overview of Mind-ParaWorld Framework.
Figure 2: Process-level analysis under Setting C on the relationship between tool-call budget and evidence coverage for three representative search agents (GPT-5, MindWatcher, and MiniMax-m2.1). Upper: mean marginal newly covered atomic facts at the k-th tool call (left axis) and cohort size n(k) (right axis), computed over trajectories with $\mathrm{ToolCalls}\ge k_{\text{cohort}}$; regions with n(k)<50 are shown with reduced opacity. Lower: truncated cumulative curves of factual coverage FCR(k) (solid, left axis) and cumulative hit precision HitPrec(k) (dashed, right axis) computed over trajectories with $\mathrm{ToolCalls}\ge k_{\text{trunc}}$, revealing diminishing marginal gains and saturation under longer interactions.
Figure 3: Relationship between FCR and Pass@1. The curve shows sample-level correlation from 4,824 samples across Setting B and C. Scatter points represent model-level performance: Setting B Guidance, Setting B Few-shot, and Setting C. The dashed line indicates Setting A performance (FCR=1.0). Shadow band width reflects sample density.

Evaluating the Search Agent in a Parallel World

TL;DR

Abstract

Evaluating the Search Agent in a Parallel World

Authors

TL;DR

Abstract

Table of Contents

Figures (3)