Table of Contents
Fetching ...

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia

TL;DR

ReportBench introduces a scalable benchmark for assessing the content quality of research reports generated by Deep Research agents, focusing on the quality of cited literature and the factual fidelity of statements. It automatically constructs evaluation tasks from expert-authored arXiv surveys via reverse prompt engineering and validates outputs with a two-track verification pipeline: citation-based checks for cited content and web-based fact verification for non-cited claims. Experimental results show commercial DR agents outperform base models in coverage and grounding, but still suffer from hallucination and over-citation, underscoring areas for improvement. The framework and accompanying data/tooling are released to support reproducible evaluation and future enhancements in AI-assisted scholarly reporting.

Abstract

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

TL;DR

ReportBench introduces a scalable benchmark for assessing the content quality of research reports generated by Deep Research agents, focusing on the quality of cited literature and the factual fidelity of statements. It automatically constructs evaluation tasks from expert-authored arXiv surveys via reverse prompt engineering and validates outputs with a two-track verification pipeline: citation-based checks for cited content and web-based fact verification for non-cited claims. Experimental results show commercial DR agents outperform base models in coverage and grounding, but still suffer from hallucination and over-citation, underscoring areas for improvement. The framework and accompanying data/tooling are released to support reproducible evaluation and future enhancements in AI-assisted scholarly reporting.

Abstract

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

Paper Structure

This paper contains 30 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overall benchmark data construction workflow.
  • Figure 2: Application domain distribution of the 678 filtered ReportBench prompts: (a) a pie chart showing the proportion of each application domain, (b) a bar chart illustrating the total task counts across all 11 categories.
  • Figure 3: Evaluation Process.