Table of Contents
Fetching ...

Agentic Uncertainty Reveals Agentic Overconfidence

Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner

TL;DR

The paper investigates agentic uncertainty, asking whether AI agents can reliably predict their own multi-step task success. It formalizes the problem with $P(IS)=P( ext{agent}_{M} ext{ succeeds on } t ig| \mathcal{I})$ and compares pre-, mid-, and post-execution uncertainty agents, plus an adversarial post-execution variant, across 100 SWE-bench Pro tasks and three frontier models. Across regimes, agents exhibit pervasive overconfidence and calibration challenges, though adversarial framing and pre-execution estimates show improvements in calibration and discrimination, respectively. The findings emphasize the limits of self-assessment for autonomous coding workflows and advocate for hybrid deployment strategies that combine diverse uncertainty signals with human oversight to ensure safe and reliable decision-making. Overall, agentic self-assessment remains a critical safety challenge as AI systems scale to longer, more complex, and higher-stakes tasks.

Abstract

Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.

Agentic Uncertainty Reveals Agentic Overconfidence

TL;DR

The paper investigates agentic uncertainty, asking whether AI agents can reliably predict their own multi-step task success. It formalizes the problem with and compares pre-, mid-, and post-execution uncertainty agents, plus an adversarial post-execution variant, across 100 SWE-bench Pro tasks and three frontier models. Across regimes, agents exhibit pervasive overconfidence and calibration challenges, though adversarial framing and pre-execution estimates show improvements in calibration and discrimination, respectively. The findings emphasize the limits of self-assessment for autonomous coding workflows and advocate for hybrid deployment strategies that combine diverse uncertainty signals with human oversight to ensure safe and reliable decision-making. Overall, agentic self-assessment remains a critical safety challenge as AI systems scale to longer, more complex, and higher-stakes tasks.

Abstract

Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.
Paper Structure (33 sections, 1 equation, 9 figures, 3 tables)

This paper contains 33 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Agentic overconfidence. We measure the overconfidence as the difference between the estimated success probability and the true success probability (true rates: GPT-5.2 Codex 35%, Gemini-3-Pro 22%, Opus 4.5 27%). We plot three strategies: pre-, post-, and adversarial-post-execution. All agents systematically overestimate their success.
  • Figure 2: Agentic Uncertainty Regimes. Each regime observes different information. Post-execution and adversarial post-execution occur at the same point but use different prompts.
  • Figure 3: Uncertainty Agent Prompt Excerpts.Pre-execution explores the codebase before any solution attempt. Mid-execution evaluates an agent's partial trajectory for signs of progress or struggle. Post-execution reviews a proposed patch. Adversarial post-execution explicitly prompts bug-finding before estimation. All agents output probability estimates $[0,100]$.
  • Figure 4: Distribution of post-execution confidence estimates by model. Success cases shown above the axis (green), failure cases below (red); dashed lines indicate base rates. Mirror symmetry reveals indistinguishable distributions: where bars match above and below, the model assigns identical confidence regardless of outcome. Gemini exhibits the most extreme pattern: nearly all predictions cluster at 100% confidence, creating dramatic mirrored towers. This visual symmetry directly explains the poor discrimination: high-confidence predictions provide no signal about actual success.
  • Figure 5: Calibration curves reveal systematic overconfidence. Points below the diagonal (shaded region) indicate overconfidence: models predict higher success probability than achieved. All methods fall in this region across all models. Gemini shows the most severe miscalibration: predictions near 100% yield only $\sim$20% accuracy. The adversarial method (triangles) consistently shifts curves upward toward the diagonal, achieving the best calibration, while pre-execution (circles) shows less extreme overconfidence than standard post-execution (squares) for GPT and Claude.
  • ...and 4 more figures