Agentic Uncertainty Reveals Agentic Overconfidence
Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner
TL;DR
The paper investigates agentic uncertainty, asking whether AI agents can reliably predict their own multi-step task success. It formalizes the problem with $P(IS)=P( ext{agent}_{M} ext{ succeeds on } t ig| \mathcal{I})$ and compares pre-, mid-, and post-execution uncertainty agents, plus an adversarial post-execution variant, across 100 SWE-bench Pro tasks and three frontier models. Across regimes, agents exhibit pervasive overconfidence and calibration challenges, though adversarial framing and pre-execution estimates show improvements in calibration and discrimination, respectively. The findings emphasize the limits of self-assessment for autonomous coding workflows and advocate for hybrid deployment strategies that combine diverse uncertainty signals with human oversight to ensure safe and reliable decision-making. Overall, agentic self-assessment remains a critical safety challenge as AI systems scale to longer, more complex, and higher-stakes tasks.
Abstract
Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.
