Towards a Science of AI Agent Reliability

Stephan Rabanser; Sayash Kapoor; Peter Kirgis; Kangheng Liu; Saiteja Utpala; Arvind Narayanan

Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

TL;DR

The paper confronts the gap between rising AI-agent accuracy and real-world reliability by introducing a safety-engineering-inspired framework that decomposes reliability into four dimensions: consistency, robustness, predictability, and safety. It defines twelve concrete metrics and demonstrates their application to 14 models across two benchmarks (GAIA and $\tau$-bench), using multi-run tests, prompt perturbations, fault injection, environment perturbations, confidence elicitation, and safety analyses. Across 18 months of model releases, reliability lags behind capability, with dimension-specific weaknesses in outcome consistency, prompt robustness, and per-task predictability, particularly in open-ended tasks. The paper argues for dynamic, generative benchmarks and governance practices that treat reliability as a core deployment criterion, not a byproduct of accuracy, and outlines future research directions for improving reliability in autonomous and augmentation settings.

Abstract

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

Towards a Science of AI Agent Reliability

TL;DR

-bench), using multi-run tests, prompt perturbations, fault injection, environment perturbations, confidence elicitation, and safety analyses. Across 18 months of model releases, reliability lags behind capability, with dimension-specific weaknesses in outcome consistency, prompt robustness, and per-task predictability, particularly in open-ended tasks. The paper argues for dynamic, generative benchmarks and governance practices that treat reliability as a core deployment criterion, not a byproduct of accuracy, and outlines future research directions for improving reliability in autonomous and augmentation settings.

Abstract

Paper Structure (119 sections, 2 equations, 26 figures, 7 tables)

This paper contains 119 sections, 2 equations, 26 figures, 7 tables.

Introduction
Scope.
A Cross-Domain Perspective of Reliability
Synthesis.
Operationalizing Reliability for AI Agents
Consistency ($\mathcal{R}_{\text{Con}}$)
Robustness ($\mathcal{R}_{\text{Rob}}$)
Predictability ($\mathcal{R}_{\text{Pred}}$)
Safety ($\mathcal{R}_{\text{Saf}}$)
Aggregation
Consistency.
Safety.
Overall reliability.
Disentangling Reliability & Capability
Experiments
...and 104 more sections

Figures (26)

Figure 1: Reliability gains lag behind capability progress. Overall reliability shows slow improvement over time. While accuracy rises steadily across both benchmarks (left), reliability trails behind (center), and the relationship between the two varies across benchmarks (right), indicating that accuracy gains do not automatically yield reliability.
Figure 2: Outcome consistency across models. Results show only modest consistency across the board; even current frontier models do not reliably improve across both benchmarks.
Figure 3: Prompt robustness across models. Many models remain susceptible to surface-level prompt reformulations. Latest frontier models generally show modest but not dependable improvements.
Figure 4: Calibration and discrimination across models. Calibration, the alignment between predicted confidence and accuracy, generally improves in frontier models. Discrimination performance, the ability to distinguish correct and incorrect predictions, is inconsistent across benchmarks and has in fact generally worsened on GAIA.
Figure 5: Safety analysis on $\tau$-bench.Top: Average violations per evaluation run stratified by severity level. Bottom: Breakdown of violations by constraint category. The most recent frontier models exhibit significantly lower overall violation rates. Financial accuracy (i.e., incorrect charges/refunds) remains the most common failure mode across all models.
...and 21 more figures

Towards a Science of AI Agent Reliability

TL;DR

Abstract

Towards a Science of AI Agent Reliability

Authors

TL;DR

Abstract

Table of Contents

Figures (26)