Table of Contents
Fetching ...

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee

TL;DR

DEER tackles the challenge of evaluating expert-level deep research reports by introducing a comprehensive benchmark with a hierarchical Deep Research Report Evaluation Taxonomy, fixed 130 rubrics, and task-specific expert guidance, paired with a document-wide fact-verification module. It combines holistic rubric-based scoring with claim-level verification to assess both report quality and external evidence, addressing limitations of prior LLM-based judges and narrow source checks. Empirical results show strong performance on formatting and ethics but gaps in evidence validity and information sufficiency, while the verification module improves reliability and diagnostic power. By demonstrating close alignment with human judgments and offering interpretable diagnostics, DEER provides a scalable framework for advancing autonomous deep-research agents toward trustworthy, expert-level reporting.

Abstract

As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

TL;DR

DEER tackles the challenge of evaluating expert-level deep research reports by introducing a comprehensive benchmark with a hierarchical Deep Research Report Evaluation Taxonomy, fixed 130 rubrics, and task-specific expert guidance, paired with a document-wide fact-verification module. It combines holistic rubric-based scoring with claim-level verification to assess both report quality and external evidence, addressing limitations of prior LLM-based judges and narrow source checks. Empirical results show strong performance on formatting and ethics but gaps in evidence validity and information sufficiency, while the verification module improves reliability and diagnostic power. By demonstrating close alignment with human judgments and offering interpretable diagnostics, DEER provides a scalable framework for advancing autonomous deep-research agents toward trustworthy, expert-level reporting.

Abstract

As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

Paper Structure

This paper contains 82 sections, 8 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Deep Research System Performance Comparison. Shows the performance by type for 5 major models in the proposed benchmark.
  • Figure 2: Overview of the DEER evaluation framework. (a) Research question and expert guidance generation from real-world deep research queries. (b) Construction of the Deep Research Evaluation Taxonomy consisting of 7 dimensions, 26 criteria, and 130 granular rubrics. (c) The DEER evaluation pipeline, integrating expert-guided LLM-as-a-judge scoring with claim extraction and information verification to assess deep research reports.
  • Figure 3: Heatmap visualizations of expert report evaluation results. (a) Criteria-wise scores across detailed evaluation categories. (b) Domain-wise scores averaged over two representative tasks randomly sampled from each domain.
  • Figure 4: Topic domains extracted from real-world Deep Research service logs