Table of Contents
Fetching ...

On Randomness in Agentic Evals

Bjarni Haukur Bjarnason, André Silva, Martin Monperrus

TL;DR

The paper investigates the reliability of agentic evaluations by quantifying randomness across $60{,}000$ trajectories from three models and two scaffolds on SWE-Bench-Verified. It shows substantial run-to-run variance in single-run $pass@1$ scores, with divergence occurring early in the token stream and cascading into different strategies, even at $T=0$. By analyzing $pass@1$, $pass@k$, and $pass^{\wedge}k$, the authors reveal gaps up to $24.9$ percentage points between optimistic and pessimistic bounds, underscoring the influence of stochastic exploration on observed performance. They propose concrete practices—multiple independent runs per task, statistical power analysis to plan runs, and reporting $pass@k$ and $pass^{\wedge}k$ with $k>1$—to distinguish genuine progress from noise. Overall, the work argues for more robust evaluation protocols to ensure that reported advances reflect true algorithmic improvements rather than evaluation variance.

Abstract

Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

On Randomness in Agentic Evals

TL;DR

The paper investigates the reliability of agentic evaluations by quantifying randomness across trajectories from three models and two scaffolds on SWE-Bench-Verified. It shows substantial run-to-run variance in single-run scores, with divergence occurring early in the token stream and cascading into different strategies, even at . By analyzing , , and , the authors reveal gaps up to percentage points between optimistic and pessimistic bounds, underscoring the influence of stochastic exploration on observed performance. They propose concrete practices—multiple independent runs per task, statistical power analysis to plan runs, and reporting and with —to distinguish genuine progress from noise. Overall, the work argues for more robust evaluation protocols to ensure that reported advances reflect true algorithmic improvements rather than evaluation variance.

Abstract

Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.
Paper Structure (19 sections, 9 equations, 7 figures, 3 tables)

This paper contains 19 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Performance bounds revealed by pass@k and passk for DeepSWE-preview on r2e-gym and Devstral-2 on nano-agent. The vertical distance between curves quantifies how much performance depends on random choices. DeepSWE-preview exhibits wider gaps (high sensitivity to randomness), while Devstral-2 shows narrower gaps (more consistent solutions), though both demonstrate substantial dependence on stochastic exploration as $k$ increases.
  • Figure 2: Distribution of first token divergence across different models under nano-agent. In blue, we show the distributions with temperature 0, while in orange we show the distributions with the suggested temperatures. On top, we plot by absolute token position, while on bottom, we plot by relative position (percentage through the trajectory). The distributions are shown for all pairs of divergent runs, one per model-scaffold pair.
  • Figure 3: A subtle reasoning divergence at token 94 cascades into opposite outcomes. Both runs share identical reasoning through the first paragraph, understanding the task of adding an __iter__ method to Django's Paginator class. At token 94, the reasoning diverges: run 1 reasons "Let me search..." while run 2 reasons "Let me check...Using the shell tool...". This difference leads to a different first tool call, which propagates through subsequent steps, with only run 2 succeeding. Even at temperature 0, non-determinism causes trajectory divergence that compounds into fundamentally different problem-solving strategies.
  • Figure 4: Required number of runs per agent under test for detecting improvements of different magnitudes (1%, 2%, 5%, 10%) under three variance scenarios observed in our experiments, at significance level $p < 0.05$ and 80% statistical power. The minimum variance scenario ($\sigma = 0.7\%$) represents the most favorable case, while the maximum variance ($\sigma = 1.8\%$) represents the most challenging evaluation conditions. The exponential increase in required runs for smaller improvements, particularly at higher variance levels, demonstrates that single-run evals cannot reliably distinguish small performance differences from random variations.
  • Figure 5: Required number of runs per agent under test for detecting improvements of different magnitudes (1%, 2%, 5%, 10%) at varying statistical power levels (70%, 80%, 90%, 95%), assuming median observed variance ($\sigma = 1.5\%$) and significance level $p < 0.05$. Higher desired statistical power requires substantially more runs, particularly for detecting small improvements. For example, detecting a 2% improvement with 80% power requires 9 runs per agent, while achieving 95% power for the same effect size requires 15 runs. The exponential growth in required sample size for smaller effect sizes demonstrates why single-run evals are insufficient for reliably detecting small improvements.
  • ...and 2 more figures