Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
TL;DR
Mind2Web 2 addresses the challenge of evaluating long-horizon, agentic web-search systems by introducing a 130-task benchmark and an automated Agent-as-a-Judge framework that uses task-specific rubric trees to assess answer correctness and source attribution. It combines a thorough task-collection pipeline with a software-enabled judge-agent system (Extractor/Verifier) to enable scalable, automated evaluation of real-time, citation-backed outputs across frontier agents and humans. Empirical results show Deep Research systems outperform LLM-only agents on time-varying tasks, with OpenAI Deep Research reaching about half to two-thirds of human performance while delivering substantially faster responses; however, time-varying information and attribution remain points of weakness. The work provides a rigorous basis for developing and benchmarking next-generation agentic search, including detailed error analyses and human evaluation of judge agents to ensure reliability and generalizability.
Abstract
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
