Table of Contents
Fetching ...

Understanding DeepResearch via Reports

Tianyu Fan, Xinyao Niu, Yuxiang Zheng, Fengji Zhang, Chengen Huang, Bei Chen, Junyang Lin, Chao Huang

TL;DR

This paper introduces DeepResearch-ReportEval, a comprehensive, report-centric evaluation framework for DeepResearch systems. It defines three evaluation dimensions—quality, redundancy, and factuality—and employs an iterative LLM–human alignment pipeline to align automated judgments with expert judgments. A curated benchmark of 100 queries across 12 real-world categories is released, along with evaluations of four commercial DeepResearch systems to reveal design philosophies and trade-offs. The work emphasizes end-to-end research capability, robust evidence grounding, and pathways toward making DeepResearch agents more interpretable, proactive partners in knowledge-intensive tasks. The framework aims to standardize end-to-end assessment of DeepResearch outputs and guide future development toward more capable, credible AI research partners.

Abstract

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

Understanding DeepResearch via Reports

TL;DR

This paper introduces DeepResearch-ReportEval, a comprehensive, report-centric evaluation framework for DeepResearch systems. It defines three evaluation dimensions—quality, redundancy, and factuality—and employs an iterative LLM–human alignment pipeline to align automated judgments with expert judgments. A curated benchmark of 100 queries across 12 real-world categories is released, along with evaluations of four commercial DeepResearch systems to reveal design philosophies and trade-offs. The work emphasizes end-to-end research capability, robust evidence grounding, and pathways toward making DeepResearch agents more interpretable, proactive partners in knowledge-intensive tasks. The framework aims to standardize end-to-end assessment of DeepResearch outputs and guide future development toward more capable, credible AI research partners.

Abstract

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

Paper Structure

This paper contains 25 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: DeepResearch systems operate through: Interaction, Investigation, and Synthesis.
  • Figure 2: Current search queries (upper), from BrowseComp wei2025browsecomp and HotpotQA yang2018hotpotqa, seek specific answers through multi-hop reasoning. DeepResearch queries (lower) demand comprehensive investigation and synthesis, producing detailed analytical reports as shown in the figure. Additional examples appear in Appendix \ref{['app_examples']}.
  • Figure 3: Visualization of category distribution of DeepResearch queries.
  • Figure 4: Overview of the DeepResearch-ReportEval framework. The LLM-as-a-Judge approach is used to evaluate reports along the dimensions of quality, redundancy, and factuality, while LLM–Human alignment is employed to ensure the reliability.