Table of Contents
Fetching ...

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin

TL;DR

BrowseComp-Plus addresses fairness, transparency, and reproducibility in evaluating Deep-Research Agents by replacing dynamic live-web evaluation with a fixed, human-verified corpus containing supportive and hard-negative documents. It enables disentangled analysis of retriever and LLM contributions and systematically analyzes how retrieval quality, reasoning effort, and document access strategies affect end-to-end performance. Across a wide set of open- and closed-source models and retrievers, stronger retrieval substantially improves accuracy and can reduce the need for excessive search, while oracle-like retrieval reveals substantial headroom for progress. The benchmark supports rigorous, component-level benchmarking and offers a pathway toward co-optimizing retrievers and agents for more reliable, cost-effective Deep-Research systems.

Abstract

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

TL;DR

BrowseComp-Plus addresses fairness, transparency, and reproducibility in evaluating Deep-Research Agents by replacing dynamic live-web evaluation with a fixed, human-verified corpus containing supportive and hard-negative documents. It enables disentangled analysis of retriever and LLM contributions and systematically analyzes how retrieval quality, reasoning effort, and document access strategies affect end-to-end performance. Across a wide set of open- and closed-source models and retrievers, stronger retrieval substantially improves accuracy and can reduce the need for excessive search, while oracle-like retrieval reveals substantial headroom for progress. The benchmark supports rigorous, component-level benchmarking and offers a pathway toward co-optimizing retrievers and agents for more reliable, cost-effective Deep-Research systems.

Abstract

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

Paper Structure

This paper contains 39 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Accuracy vs. number of search calls for Deep-Research agents with different retrievers. GPT5, o3, gpt-oss are evaluated with high reasoning effort. The figure shows that Deep Research agents mostly improve the final accuracy at a cost of more search calls, whereas better retrieval systems not only improve the overall accuracy but also reduce the number of search calls.
  • Figure 2: The two-stage pipeline of collecting evidence documents in the corpus (Section \ref{['section:buildingDocuemntCorpus']}).
  • Figure 3: The pipeline of collecting hard negative documents in the corpus(Section \ref{['section:hardNegativeMining']}).
  • Figure 4: (a) Token distribution of corpus length, showing up to 90th percentile for display; (b) Distribution of tokens needed to include answer in gold documents per query, showing up to 90th percentile for display
  • Figure 5: A screenshot of the annotation interface.