Evaluating Stochasticity in Deep Research Agents

Haotian Zhai; Elias Stengel-Eskin; Pratik Patil; Liu Leqi

Evaluating Stochasticity in Deep Research Agents

Haotian Zhai, Elias Stengel-Eskin, Pratik Patil, Liu Leqi

TL;DR

This paper formalizes the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes, and introduces an evaluation framework that quantifies variance in the system and identifies three sources of it: information acquisition, information compression, and inference.

Abstract

Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks a critical barrier to real-world deployment: stochasticity. Under identical queries, repeated executions of DRAs can exhibit substantial variability in terms of research outcome, findings, and citations. In this paper, we formalize the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes. We introduce an evaluation framework that quantifies variance in the system and identify three sources of it: information acquisition, information compression, and inference. Through controlled experiments, we investigate how stochasticity from these modules across different decision steps influences the variance of DRA outputs. Our results show that reducing stochasticity can improve research output quality, with inference and early-stage stochasticity contributing the most to DRA output variance. Based on these findings, we propose strategies for mitigating stochasticity while maintaining output quality via structured output and ensemble-based query generation. Our experiments on DeepSearchQA show that our proposed mitigation methods reduce average stochasticity by 22% while maintaining high research quality.

Evaluating Stochasticity in Deep Research Agents

TL;DR

Abstract

Paper Structure (36 sections, 3 theorems, 28 equations, 2 figures, 5 tables)

This paper contains 36 sections, 3 theorems, 28 equations, 2 figures, 5 tables.

Introduction
Related Work
Modeling Deep Research Agents
Evaluation Framework for DRA Stochasticity
Total Variance as a Measure of Stochasticity
Constructing Metrics from Agent Outputs
Analysis of Stochasticity via Variance Decomposition
Decomposing Stochasticity in Deep Research Agents
Empirical Investigation via Temperature Ablation
Finding 1: Early-stage stochasticity influences final-stage stochasticity more than late-stage stochasticity.
Finding 2: Findings, citations and answer stochasticity are positively correlated.
Finding 3: Variance magnitude increases monotonically with temperature.
Finding 4: Higher stochasticity does not imply higher accuracy.
Finding 5: Findings are more stochastic than citations.
Finding 6: The inference module has a greater impact on the final stochasticity than the information acquisition and compression modules.
...and 21 more sections

Key Result

Proposition 4.1

Let $\mathbf{X}$ be a random vector in $\mathbb{R}^d$ with finite mean $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$ and covariance matrix $\boldsymbol{\Sigma} = \mathop{\mathrm{\rm Var}}\nolimits(\mathbf{X})$. Let $\mathbf{X}_1$ and $\mathbf{X}_2$ be independent and identically distributed (i.i.d.) c

Figures (2)

Figure 1: Overview of the evaluation pipeline. The process begins with a user question which triggers multiple independent Deep Research Agent (DRA) runs (in this example we use number of runs $k = 2$). The resulting reports are decomposed into answers, findings, and citations, and then clustered. After clustering, each report's answers, findings, and citations are mapped to binary vectors and normalized to compute the Total Variance (TV) as a measure of stochasticity. We expand findings as an example to illustrate the whole process. In reality, answers and citations are also processed in similar ways as in \ref{['sec:metrics']}.
Figure 2: Comprehensive Analysis of Stochasticity Behavior. (a) Early-stage injections dominate propagation. (b) Strong positive correlations across answer, finding, and citation TV. (c) Higher sampling temperature increases total variance. (d) The Update module contributes the largest variance.

Theorems & Definitions (5)

Proposition 4.1
Corollary 4.2: Total variance
proof
Proposition A.1: TV Decomposition
proof

Evaluating Stochasticity in Deep Research Agents

TL;DR

Abstract

Evaluating Stochasticity in Deep Research Agents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (5)