Table of Contents
Fetching ...

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou

TL;DR

DeepResearch Arena introduces a seminar-grounded benchmark to evaluate deep research agents in authentic, ill-structured settings. The authors present MAHTG to automatically extract research-inspired leads from seminar transcripts and turn them into 10k+ open-ended tasks across 12 disciplines, paired with a hybrid evaluation framework combining Keypoint-Aligned Evaluation and Adaptively-generated Checklist Evaluation. They curate a large, multidisciplinary seminar corpus and demonstrate a rigorous evaluation regime with diverse models, including leakage-detection experiments, revealing substantial performance gaps and model-wide strengths. The benchmark aims to close the gap between laboratory benchmarks and real-world research inquiry, offering a scalable, reproducible, and cognitively faithful platform for advancing autonomous research assistants.

Abstract

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

TL;DR

DeepResearch Arena introduces a seminar-grounded benchmark to evaluate deep research agents in authentic, ill-structured settings. The authors present MAHTG to automatically extract research-inspired leads from seminar transcripts and turn them into 10k+ open-ended tasks across 12 disciplines, paired with a hybrid evaluation framework combining Keypoint-Aligned Evaluation and Adaptively-generated Checklist Evaluation. They curate a large, multidisciplinary seminar corpus and demonstrate a rigorous evaluation regime with diverse models, including leakage-detection experiments, revealing substantial performance gaps and model-wide strengths. The benchmark aims to close the gap between laboratory benchmarks and real-world research inquiry, offering a scalable, reproducible, and cognitively faithful platform for advancing autonomous research assistants.

Abstract

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

Paper Structure

This paper contains 36 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of seminar domains and task structures in MAHTG.Left: Distribution of academic seminars across diverse domains such as Science & Technology, Health, Finance, and others. The outer arc further decomposes each domain into representative research tasks. For instance, Science & Technology includes tasks such as Hypothesis Generation, Empirical Test, Prototype Specification, and Trend Scan. Right: Illustration of MAHTG's multi-agent pipeline, where seminar content is transformed into structured research tasks via intermediate inspirations (e.g., Methodology, Transdisciplinarity). Example outputs are shown for both stages.
  • Figure 2: Overview of our benchmark construction pipeline, including four stages: (a) Data generation from transcribed seminar videos, (b) extraction of research inspirations, (c) multi-phase task design, and (d) evaluation using both KAE and ACE metrics.
  • Figure 3: Comparison of current mainstream models on the DeepResearch Arena benchmark. (a) Performance across 12 research disciplines (e.g., Science & Technology, Art, Finance). (b) Performance across 10 research task types (e.g., Hypothesis Generation, Method Blueprint, Evaluation Metric Design), highlighting task-specific capabilities.
  • Figure 4: Comparison of DeepResearch agents in terms of Keypoint-Aligned Evaluation (KAE) metrics and efficiency.