WideSearch: Benchmarking Agentic Broad Info-Seeking
Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
TL;DR
WideSearch addresses the bottleneck of wide-scale information seeking by introducing a bilingual, multi-domain benchmark (200 tasks: 100 English, 100 Chinese) that requires agents to gather, verify, and organize large-scale atomic data into structured tables from live web sources. The framework combines a rigorous five-stage data-curation pipeline with an automated, hybrid evaluation system to measure table-level completeness and fidelity, validated against human judgments. Across 10+ agent systems, including single- and multi-agent configurations and end-to-end commercial tools, results show near-zero table-level success, with item-level recall improvable through retries and humans performing better than machines. Analyses reveal fundamental deficiencies in planning, reflection, and evidence grounding, suggesting multi-agent collaboration as a promising direction to improve large-scale, high-fidelity information gathering. The benchmark and evaluation pipeline are publicly available to drive future progress in robust agentic search.
Abstract
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/
