Table of Contents
Fetching ...

When Should Users Check? A Decision-Theoretic Model of Confirmation Frequency in Multi-Step AI Agent Tasks

Jieyu Zhou, Aryan Roy, Sneh Gupta, Daniel Weitekamp, Christopher J. MacLellan

TL;DR

This work addresses when users should intervene during long-horizon, agentic AI tasks. It introduces a decision-theoretic, minimum-time scheduling model that places intermediate confirmations (CDCR-based checks) to balance confirmation burden against error-diagnosis cost, and validates it with a formative study and a larger within-subjects experiment. Findings show a strong user preference (81%) for intermediate confirmation and a 13.54% reduction in task completion time, with early-stage errors yielding the greatest gains. The contributions position confirmation as a mixed-initiative interaction design, offering guidance for more reliable, user-supervised agent systems and pointing to broader cost and trust considerations in practical deployments.

Abstract

Existing AI agents typically execute multi-step tasks autonomously and only allow user confirmation at the end. During execution, users have little control, making the confirm-at-end approach brittle: a single error can cascade and force a complete restart. Confirming every step avoids such failures, but imposes tedious overhead. Balancing excessive interruptions against costly rollbacks remains an open challenge. We address this problem by modeling confirmation as a minimum time scheduling problem. We conducted a formative study with eight participants, which revealed a recurring Confirmation-Diagnosis-Correction-Redo (CDCR) pattern in how users monitor errors. Based on this pattern, we developed a decision-theoretic model to determine time-efficient confirmation point placement. We then evaluated our approach using a within-subjects study where 48 participants monitored AI agents and repaired their mistakes while executing tasks. Results show that 81 percent of participants preferred our intermediate confirmation approach over the confirm-at-end approach used by existing systems, and task completion time was reduced by 13.54 percent.

When Should Users Check? A Decision-Theoretic Model of Confirmation Frequency in Multi-Step AI Agent Tasks

TL;DR

This work addresses when users should intervene during long-horizon, agentic AI tasks. It introduces a decision-theoretic, minimum-time scheduling model that places intermediate confirmations (CDCR-based checks) to balance confirmation burden against error-diagnosis cost, and validates it with a formative study and a larger within-subjects experiment. Findings show a strong user preference (81%) for intermediate confirmation and a 13.54% reduction in task completion time, with early-stage errors yielding the greatest gains. The contributions position confirmation as a mixed-initiative interaction design, offering guidance for more reliable, user-supervised agent systems and pointing to broader cost and trust considerations in practical deployments.

Abstract

Existing AI agents typically execute multi-step tasks autonomously and only allow user confirmation at the end. During execution, users have little control, making the confirm-at-end approach brittle: a single error can cascade and force a complete restart. Confirming every step avoids such failures, but imposes tedious overhead. Balancing excessive interruptions against costly rollbacks remains an open challenge. We address this problem by modeling confirmation as a minimum time scheduling problem. We conducted a formative study with eight participants, which revealed a recurring Confirmation-Diagnosis-Correction-Redo (CDCR) pattern in how users monitor errors. Based on this pattern, we developed a decision-theoretic model to determine time-efficient confirmation point placement. We then evaluated our approach using a within-subjects study where 48 participants monitored AI agents and repaired their mistakes while executing tasks. Results show that 81 percent of participants preferred our intermediate confirmation approach over the confirm-at-end approach used by existing systems, and task completion time was reduced by 13.54 percent.

Paper Structure

This paper contains 48 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Representative Task Types and AI Agent Interface Example
  • Figure 2: An example of the interval between the last verified state and the next confirmation point. The colors denote the correctness of the plan execution at that point, but this correctness is only observed when the user inspects them (denoted by solid circles). When the user finds an error in state $s_j$, they must go back to $s_i$ and check each state one at a time until they find the first error. This requires them to inspect $m-1$ states. Once the first error is identified, the agent then has to redo $j-m+1$ actions. Note, the dashed circles signify that the correctness of a state is not known until diagnosis.
  • Figure 3: Dynamic programming algorithm and results for an agent action sequence of length $N=5$ with action success probabilities $p=[0.7,0.7,0.9,0.85,0.85]$. Costs are parameterized as $t_{\text{confirm}}[k] = 1$, $t_{\text{diagnose}}[k] = 1$, and $t_{\text{redo}}[k] = 1$, for all $k \in \{0,\dots,N\}$. Each cell $[\textit{start}, \textit{end}]$ specifies the expected user time to correctly execute the rest of the action sequence from $start$ to $N$, if the system has a verification checkpoint at state end. White checkmarks indicate the $next\_ckpt$ locations for each $start$. The right-hand annotations summarize the resulting checkpoint policy;
  • Figure 4: Example of the simulated shopping task interface with intermediate confirmation, the activity list on the left and the current action screenshot on the right, enabling users to compare and identify the first error.
  • Figure 5: Average completion time under different confirmation strategies. Left: Intermediate confirmation reduced total time compared to end confirmation. Right: Effects vary by error location—large gains for early errors, minimal for mid, and slight cost for late errors. Error bars show 95% CIs.
  • ...and 1 more figures