Table of Contents
Fetching ...

Nonstandard Errors in AI Agents

Ruijiang Gao, Steven Chong Xiao

Abstract

We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

Nonstandard Errors in AI Agents

Abstract

We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.
Paper Structure (50 sections, 4 figures, 11 tables)

This paper contains 50 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Experimental design overview. 150 AI agents independently analyze the same NYSE TAQ data (Stage 1), receive AI peer review (Stage 2), and see the five highest-rated exemplar papers (Stage 3). At each stage, agents face a "garden of forking paths" steegen2016increasing: measure choice, functional form, and other decision forks. Peer review causes undirected movement (IQR unchanged); exemplar exposure causes correlated imitation (IQR collapses within measure families).
  • Figure 2: Effect size distributions (%/yr) by stage. Boxes show IQR (Q25--Q75) with IQR values annotated; whiskers extend to Q10 and Q90; horizontal line is the median. H2 and H5 have small IQR in S1, driven almost entirely by the log-vs-level specification fork (within either specification, agents agree to within 0.01%/yr). H4 is bimodal in S1 (dollar vs. share volume). H5 collapses in S3 as 96% of agents adopt year dummies. H6 converges in S3 as 99% of agents adopt trade-level price impact.
  • Figure 3: IQR by stage for each hypothesis. Blue = S1, orange = S2, green = S3. The transition from S1 to S2 (peer review) produces minimal change. The transition from S2 to S3 (top papers) drives convergence for H2, H3, and H6, and divergence for H1 and H5.
  • Figure 4: Methodology choice convergence across stages. Each stacked bar shows the fraction of agents using each method. H5 year dummies: adoption rises from 1% (S1) to 96% (S3). H6 impact measure: price impact rises from 56% to 99%. H1 measure: autocorrelation drops from 58% to 17% as agents switch to variance ratio after seeing top papers.