Towards a Science of AI Agent Reliability
Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan
TL;DR
The paper confronts the gap between rising AI-agent accuracy and real-world reliability by introducing a safety-engineering-inspired framework that decomposes reliability into four dimensions: consistency, robustness, predictability, and safety. It defines twelve concrete metrics and demonstrates their application to 14 models across two benchmarks (GAIA and $\tau$-bench), using multi-run tests, prompt perturbations, fault injection, environment perturbations, confidence elicitation, and safety analyses. Across 18 months of model releases, reliability lags behind capability, with dimension-specific weaknesses in outcome consistency, prompt robustness, and per-task predictability, particularly in open-ended tasks. The paper argues for dynamic, generative benchmarks and governance practices that treat reliability as a core deployment criterion, not a byproduct of accuracy, and outlines future research directions for improving reliability in autonomous and augmentation settings.
Abstract
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
