Table of Contents
Fetching ...

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu

TL;DR

ResearcherBench introduces a first-of-its-kind benchmark for evaluating Deep AI Research Systems on frontier AI questions. It combines a carefully curated dataset of 65 frontier questions across 35 AI subjects with a dual evaluation framework: rubric-based insight quality and factual assessment for faithfulness and groundedness. The study finds that premier DARS outperform basic web-search LLMs on frontier tasks, though groundedness remains a challenge and correlates weakly with overall quality; open consulting questions emerge as the strongest capability. By open-sourcing the benchmark, the authors aim to shift AI evaluation toward genuine research partnership, fostering standards that support AI-driven scientific discovery and iterative self-improvement toward ASI trajectories.

Abstract

The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry

TL;DR

ResearcherBench introduces a first-of-its-kind benchmark for evaluating Deep AI Research Systems on frontier AI questions. It combines a carefully curated dataset of 65 frontier questions across 35 AI subjects with a dual evaluation framework: rubric-based insight quality and factual assessment for faithfulness and groundedness. The study finds that premier DARS outperform basic web-search LLMs on frontier tasks, though groundedness remains a challenge and correlates weakly with overall quality; open consulting questions emerge as the strongest capability. By open-sourcing the benchmark, the authors aim to shift AI evaluation toward genuine research partnership, fostering standards that support AI-driven scientific discovery and iterative self-improvement toward ASI trajectories.

Abstract

The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.

Paper Structure

This paper contains 60 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: ResearcherBench Framework Overview. The framework consists of three main components from top to bottom: (1) Dataset collection from authentic research scenarios leading to expert-generated rubrics, (2) Rubric assessment to evaluate coverage against rubrics, and (3) Factual assessment to measure faithfulness and groundedness scores.
  • Figure 2: AI Benchmark Topic Distribution with Representative Examples. Left Side: Pie chart showing the distribution of AI subjects in the benchmark. Right Side: Concrete question examples from major subjects.
  • Figure 3: Performance Analysis by Question Type (Rubric Assessment Coverage). Performance comparison across different question types for Deep AI Research Systems. Each system shows varying strengths across open consulting, technical details, and literature review categories.