Table of Contents
Fetching ...

WideSearch: Benchmarking Agentic Broad Info-Seeking

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang

TL;DR

WideSearch addresses the bottleneck of wide-scale information seeking by introducing a bilingual, multi-domain benchmark (200 tasks: 100 English, 100 Chinese) that requires agents to gather, verify, and organize large-scale atomic data into structured tables from live web sources. The framework combines a rigorous five-stage data-curation pipeline with an automated, hybrid evaluation system to measure table-level completeness and fidelity, validated against human judgments. Across 10+ agent systems, including single- and multi-agent configurations and end-to-end commercial tools, results show near-zero table-level success, with item-level recall improvable through retries and humans performing better than machines. Analyses reveal fundamental deficiencies in planning, reflection, and evidence grounding, suggesting multi-agent collaboration as a promising direction to improve large-scale, high-fidelity information gathering. The benchmark and evaluation pipeline are publicly available to drive future progress in robust agentic search.

Abstract

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

WideSearch: Benchmarking Agentic Broad Info-Seeking

TL;DR

WideSearch addresses the bottleneck of wide-scale information seeking by introducing a bilingual, multi-domain benchmark (200 tasks: 100 English, 100 Chinese) that requires agents to gather, verify, and organize large-scale atomic data into structured tables from live web sources. The framework combines a rigorous five-stage data-curation pipeline with an automated, hybrid evaluation system to measure table-level completeness and fidelity, validated against human judgments. Across 10+ agent systems, including single- and multi-agent configurations and end-to-end commercial tools, results show near-zero table-level success, with item-level recall improvable through retries and humans performing better than machines. Analyses reveal fundamental deficiencies in planning, reflection, and evidence grounding, suggesting multi-agent collaboration as a promising direction to improve large-scale, high-fidelity information gathering. The benchmark and evaluation pipeline are publicly available to drive future progress in robust agentic search.

Abstract

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

Paper Structure

This paper contains 27 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: A conceptual comparison of manual and agent-based approaches for WideSearch tasks. The diagram illustrates the operational workflow and inherent limitations associated with two distinct methodologies for large-scale information seeking. It contrasts the labor-intensive nature of the traditional manual approach with the potential efficiencies and novel failure modes of automated search agents. This comparison underscores the necessity for a systematic evaluation to quantify agent performance and reliability.
  • Figure 2: An overview and detailed comparison of DeepSearch, DeepResearch, and our WideSearch. The conceptual map on the left (a) illustrates the high-level relationships and operational domains of the three paradigms. The table on the right (b) provides a detailed breakdown, contrasting them across key dimensions including core tasks, evaluation methods, and primary value propositions.
  • Figure 3: An overview of our integrated data pipeline, detailing the five-stage data curation and validation pipeline (left), and the automated evaluation pipeline (right).
  • Figure 4: A visually enhanced example of a task from our benchmark. The task is separated into a styled Task Prompt box, a Ground-Truth box, and an Evaluation Criteria box.
  • Figure 5: Distribution of the 18 distinct topics across the 200 tasks in the WideSearch benchmark, ensuring broad domain coverage.
  • ...and 7 more figures