DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham; Changhao Li; Ron Dorfman; Roy Ganz; Oren Nuriel; Amir Dudai; Aviad Aberdam; Noah Flynn; Elman Mansimov; Adi Kalyanpur; Ron Litman

DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, Ron Litman

TL;DR

DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic, is proposed, offering a scalable, reference-free evaluation paradigm.

Abstract

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

DREAM: Deep Research Evaluation with Agentic Metrics

TL;DR

Abstract

Paper Structure (58 sections, 5 equations, 10 figures, 12 tables)

This paper contains 58 sections, 5 equations, 10 figures, 12 tables.

Introduction
Deep Research Evaluation Landscape
A Unifying Taxonomy
Diagnosing the Evaluation Landscape
Human-Defined Evaluation Criteria.
Closed-Loop LLM-Based Evaluation.
Citation-Alignment Workflows.
Systematic Imbalance and Evaluator Capability Mismatch.
DREAM: DRE with Agentic Metrics
Phase 1: Protocol Creation
Static Metrics.
Adaptive Metrics.
Phase 2: Protocol Execution
LLM Evaluator.
Agent Evaluator.
...and 43 more sections

Figures (10)

Figure 1: Capturing Overlooked Dimensions of Research Quality.DREAM actively verifies the reasoning of generated reports by probing external sources (left), detects factual errors injected in a controlled experiment (middle), and captures time-sensitive validity gaps by penalizing outdated reports (right).
Figure 2: DREAM Overview. Our framework operates in two phases. Left: Protocol Creation, where query-independent Static Metrics are combined with Adaptive Metrics constructed by an agent equipped with web search tools and optional tools to access external data. Right: Protocol Execution, where each metric is routed to the appropriate evaluator, either an LLM, agent with tool access, or workflow.
Figure 3: DREAM Protocol Execution Evaluators. (a) LLM Evaluator assesses writing quality (WQ) and key-information coverage (KIC); (b) Agent Evaluator evaluates reasoning quality (RQ) using external tools; (c) Workflow Evaluator performs factuality assessment via evidence retrieval, citation integrity (CI) verification through claim-source validation, and domain authoritativeness (DA) scoring via credibility assessment of extracted citations.
Figure 4: Temporal Awareness in KIC Evaluation. Comparison of evaluation criteria for a TikTok legal status query, showing DeepResearch Bench's static criteria (left) versus DREAM's KIC criteria (right) that incorporate time-sensitive facts (e.g., mid-December 2025 joint venture deal and January 23, 2026 deadline).
Figure 5: Reasoning flaws detection. Relative score degradations between well-reasoned and malformed reports. DREAM--RQ centers around $40.1\%$ degradation, while RACE centers around $9.1\%$, with several malformed reports outscoring well-reasoned ones.
...and 5 more figures

DREAM: Deep Research Evaluation with Agentic Metrics

TL;DR

Abstract

DREAM: Deep Research Evaluation with Agentic Metrics

Authors

TL;DR

Abstract

Table of Contents

Figures (10)