Table of Contents
Fetching ...

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, Yingchun Wang

TL;DR

The paper introduces Rigorous Bench, a high-complexity benchmark of 214 report-style queries across 10 domains, designed to rigorously evaluate Deep Research Agents (DRAs) on long-form outputs with expert-constructed reference bundles. It defines a multidimensional evaluation framework based on semantic quality, topical focus, and retrieval trustworthiness, combining QSRs, GRRs, TSLs, and keyword drift signals into an IntegratedScore via $\text{IntegratedScore} = \text{Quality} \times (1 - \text{SemanticDrift}) \times \text{TrustworthyBoost} \times 100$. The construction pipeline uses multi-stage expert design, LLM auditing, and cross-validation to ensure reliability, while experiments across 13 models demonstrate DRAs generally outperform web-enhanced baselines but reveal persistent architectural and behavioral trade-offs. The framework offers a scalable, transferable approach to evaluating structured, long-form outputs and supports future capability assessment, architectural refinement, and paradigm advancement in DRA systems.

Abstract

Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

TL;DR

The paper introduces Rigorous Bench, a high-complexity benchmark of 214 report-style queries across 10 domains, designed to rigorously evaluate Deep Research Agents (DRAs) on long-form outputs with expert-constructed reference bundles. It defines a multidimensional evaluation framework based on semantic quality, topical focus, and retrieval trustworthiness, combining QSRs, GRRs, TSLs, and keyword drift signals into an IntegratedScore via . The construction pipeline uses multi-stage expert design, LLM auditing, and cross-validation to ensure reliability, while experiments across 13 models demonstrate DRAs generally outperform web-enhanced baselines but reveal persistent architectural and behavioral trade-offs. The framework offers a scalable, transferable approach to evaluating structured, long-form outputs and supports future capability assessment, architectural refinement, and paradigm advancement in DRA systems.

Abstract

Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.

Paper Structure

This paper contains 34 sections, 8 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Distribution of benchmark entries.
  • Figure 2: Pipeline for benchmark construction and overview of the evaluation framework.
  • Figure 3: Taxonomy of domains.
  • Figure 4: Detailed criteria for General-Report Rubrics.
  • Figure 5: Example ID 040216 of benchmark entries.
  • ...and 6 more figures