Table of Contents
Fetching ...

Evaluating Stochasticity in Deep Research Agents

Haotian Zhai, Elias Stengel-Eskin, Pratik Patil, Liu Leqi

TL;DR

This paper formalizes the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes, and introduces an evaluation framework that quantifies variance in the system and identifies three sources of it: information acquisition, information compression, and inference.

Abstract

Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks a critical barrier to real-world deployment: stochasticity. Under identical queries, repeated executions of DRAs can exhibit substantial variability in terms of research outcome, findings, and citations. In this paper, we formalize the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes. We introduce an evaluation framework that quantifies variance in the system and identify three sources of it: information acquisition, information compression, and inference. Through controlled experiments, we investigate how stochasticity from these modules across different decision steps influences the variance of DRA outputs. Our results show that reducing stochasticity can improve research output quality, with inference and early-stage stochasticity contributing the most to DRA output variance. Based on these findings, we propose strategies for mitigating stochasticity while maintaining output quality via structured output and ensemble-based query generation. Our experiments on DeepSearchQA show that our proposed mitigation methods reduce average stochasticity by 22% while maintaining high research quality.

Evaluating Stochasticity in Deep Research Agents

TL;DR

This paper formalizes the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes, and introduces an evaluation framework that quantifies variance in the system and identifies three sources of it: information acquisition, information compression, and inference.

Abstract

Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks a critical barrier to real-world deployment: stochasticity. Under identical queries, repeated executions of DRAs can exhibit substantial variability in terms of research outcome, findings, and citations. In this paper, we formalize the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes. We introduce an evaluation framework that quantifies variance in the system and identify three sources of it: information acquisition, information compression, and inference. Through controlled experiments, we investigate how stochasticity from these modules across different decision steps influences the variance of DRA outputs. Our results show that reducing stochasticity can improve research output quality, with inference and early-stage stochasticity contributing the most to DRA output variance. Based on these findings, we propose strategies for mitigating stochasticity while maintaining output quality via structured output and ensemble-based query generation. Our experiments on DeepSearchQA show that our proposed mitigation methods reduce average stochasticity by 22% while maintaining high research quality.
Paper Structure (36 sections, 3 theorems, 28 equations, 2 figures, 5 tables)

This paper contains 36 sections, 3 theorems, 28 equations, 2 figures, 5 tables.

Key Result

Proposition 4.1

Let $\mathbf{X}$ be a random vector in $\mathbb{R}^d$ with finite mean $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$ and covariance matrix $\boldsymbol{\Sigma} = \mathop{\mathrm{\rm Var}}\nolimits(\mathbf{X})$. Let $\mathbf{X}_1$ and $\mathbf{X}_2$ be independent and identically distributed (i.i.d.) c

Figures (2)

  • Figure 1: Overview of the evaluation pipeline. The process begins with a user question which triggers multiple independent Deep Research Agent (DRA) runs (in this example we use number of runs $k = 2$). The resulting reports are decomposed into answers, findings, and citations, and then clustered. After clustering, each report's answers, findings, and citations are mapped to binary vectors and normalized to compute the Total Variance (TV) as a measure of stochasticity. We expand findings as an example to illustrate the whole process. In reality, answers and citations are also processed in similar ways as in \ref{['sec:metrics']}.
  • Figure 2: Comprehensive Analysis of Stochasticity Behavior. (a) Early-stage injections dominate propagation. (b) Strong positive correlations across answer, finding, and citation TV. (c) Higher sampling temperature increases total variance. (d) The Update module contributes the largest variance.

Theorems & Definitions (5)

  • Proposition 4.1
  • Corollary 4.2: Total variance
  • proof
  • Proposition A.1: TV Decomposition
  • proof