Table of Contents
Fetching ...

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

Yanyu Chen, Jiyue Jiang, Jiahong Liu, Yifei Zhang, Xiao Guo, Irwin King

TL;DR

Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

Abstract

The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a "high-score illusion" that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the "high-score illusion", we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent's latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

TL;DR

Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

Abstract

The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a "high-score illusion" that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the "high-score illusion", we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent's latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.
Paper Structure (41 sections, 18 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 18 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Traditional vs. TRACE Evaluation. Traditional methods (left) can create an "Illusion of Competence" with a low utility score ($U(H)=0.35$) by ignoring flawed processes. TRACE (right) evaluates the entire trajectory, rewarding systematic planning and evidence grounding to reveal "True Competence" with a high utility score ($U(H)=0.88$).
  • Figure 2: An overview of the TRACE benchmark creation and evaluation pipeline. The process begins with generating a high-quality source corpus from expert-verified academic seminars (a), from which core research concepts like hypotheses and limitations are extracted (b). These concepts are then synthesized by a "TaskWeaver" agent into our High-Quality DeepResearch Bench (c). This benchmark is purpose-built for Evaluating Multiple Agents, specifically designed to expose the "High-Score Illusion" by comparing simple success rates with our holistic utility $U(\mathcal{H})$, and to Measure Latent Attributes such as an agent's latent capability ($\bar{\lambda}_{\text{min}}$) and strategic profile (TRS). Finally, the evaluation stage (d) applies our TRACE framework, utilizing a novel dual-pathway assessment: one path verifies claims against cited evidence for support, conflict, or omission, while separate Judge LLMs use adaptive criteria to score overall quality. These assessments are integrated to compute the final, comprehensive suite of TRACE metrics, including the overall Trajectory Utility ($U(\mathcal{H})$) and its components: Efficiency ($\mathcal{E}$), Cognitive Quality ($\mathcal{C}$), Evidence Grounding ($\mathcal{G}_E$), and Reasoning Robustness ($\mathcal{R}_R$).
  • Figure 3: Comparison of different methods on various evaluation benchmarks (BrowseComp-en, GAIA, and TRACE-Core) using Qwen-30B-Base.
  • Figure 4: Re-ranking of open-source agents based on different metrics highlights the need for a holistic evaluation.