Overseeing Agents Without Constant Oversight: Challenges and Opportunities

Madeleine Grunde-McLaughlin; Hussein Mozannar; Maya Murad; Jingya Chen; Saleema Amershi; Adam Fourney

Overseeing Agents Without Constant Oversight: Challenges and Opportunities

Madeleine Grunde-McLaughlin, Hussein Mozannar, Maya Murad, Jingya Chen, Saleema Amershi, Adam Fourney

TL;DR

The paper investigates how to oversee agentic AI without constant oversight by examining trace presentation in Computer Use Agents. Through a formative study, three design probes, and a controlled experiment, it demonstrates that verbose, step-by-step traces hinder error detection, while an outcome-oriented, specification-based interface can reduce error-finding time but may inflate confidence without improving final accuracy. The findings highlight challenges in conveying agent process provenance, underspecified goals, and subjective notions of correctness, and they point to integrating verification into running workflows with improved navigation. Practically, the work advances trace design by balancing process visibility with outcome checks and by emphasizing navigation, assumptions, and binding between requirements and executed steps. These insights have implications for designing safer, more trustworthy human-AI collaboration in multi-step agentic systems.

Abstract

To enable human oversight, agentic AI systems often provide a trace of reasoning and action steps. Designing traces to have an informative, but not overwhelming, level of detail remains a critical challenge. In three user studies on a Computer User Agent, we investigate the utility of basic action traces for verification, explore three alternatives via design probes, and test a novel interface's impact on error finding in question-answering tasks. As expected, we find that current practices are cumbersome, limiting their efficacy. Conversely, our proposed design reduced the time participants spent finding errors. However, although participants reported higher levels of confidence in their decisions, their final accuracy was not meaningfully improved. To this end, our study surfaces challenges for human verification of agentic systems, including managing built-in assumptions, users' subjective and changing correctness criteria, and the shortcomings, yet importance, of communicating the agent's process.

Overseeing Agents Without Constant Oversight: Challenges and Opportunities

TL;DR

Abstract

Paper Structure (42 sections, 9 figures, 9 tables)

This paper contains 42 sections, 9 figures, 9 tables.

Introduction
Related Work
Human oversight of Computer Use Agents
Verification of HIL agentic systems
Formative study
System
Procedure
Results
Participants envision multi-tasking while running CUA
Verbosity makes returning to a CUA run cumbersome to review
Participants missed small but impactful errors
Participants employ varied verification strategies, preferring error indications in the final output, trial and error testing, and visual review
Participants display subjective and changing correctness criteria
Implications from formative study informing design probes
Design probes
...and 27 more sections

Figures (9)

Figure 1: Magentic-UI interface. Our formative study illuminates verification practices with the HIL CUA Magentic-UI mozannar2025magentic. Users first input a task [A], then use various features to co-plan, co-task, and adapt to errors as the system runs. To verify, they can first review the plan summary that decomposes the task into proposed plan steps [B]. Each plan step is then executed, and each action is described in the trace for transparency [C]. Collapsing these plan steps [D] shows summaries of each group of actions. While the task runs, users may watch the live view as the agent interacts with the browser [E]. This visual trace is stored in a series of screenshots as the agent progresses [F]. Clicking the icon next to a particular action [C] links to the associated screenshot [F] to improve navigation. Users can open tabs between different tasks to multitask [G] and interact with chat [H]. A "Final Answer" summary appears upon completion (not visible).
Figure 2: Three design probes. Our design probes focus on the process (Flowchart), outcome (Specification), or both (Citation). This figure illustrates each design for the task "How much would I save by getting annual passes for my family (2 adults, 1 kid age 5, 1 kid age 2) for the Seattle Aquarium, compared to buying daily tickets, if we visit 4 times in a year?" Additional examples are in Supplementary Materials.
Figure 3: Design probe study interface. The interface displays the given task [A], task output [B], and one of the three design probes [C]. Clicking on the design probe surfaces the relevant screenshot(s) [D] that have highlighted annotations of the relevant information [E]. The original Magentic-UI trace is available for further review [F]. Participants then answer the questions "Is this result correct?" and "Why or why not?" below [G].
Figure 4: Controlled study interface. The interface provides the task [A] and the output to verify [B]. It integrates the Magentic-UI trace directly into the interface, adding highlights to both the text [C] and screenshots [D]. The screenshots can be paged through, with the ability to magnify for detail and to open a new tab with the associated website for further review [E]. For additional path context, there is a brief summary [F] provided in the Citation format. For guiding review, there are requirements and assumptions [G], with additional explanatory context and icons indicating if requirements are completed, unknown, or contradicted. Each requirement and assumption refers to annotation(s) in the text and image trace with underlying icons [H] and tooltips for easier scanning [I]. For the study, there is a 5-minute timer [J] and a series of questions [K].
Figure 5: Control study results. The chart shows the Hedges' $g$ effect size and the 95% confidence interval. The Duration and Confidence charts show subsets such that "No-Yes" displays when the ground truth answer is "No" and the user answers "Yes". Overall, we find no effect on accuracy (0.18), a small effect decreasing duration (-0.29), and a medium effect increasing confidence (0.56). The effect on duration is medium-sized when the user correctly finds an error ("No-No": -0.65). The largest confidence effect occurs when the answer is incorrect, but the user thinks it is correct ("No-Yes": 0.85). Exact values are in Table \ref{['tab:s3adc']}. Three dotted vertical lines show the cutoffs for small, medium, and large effect sizes.
...and 4 more figures

Overseeing Agents Without Constant Oversight: Challenges and Opportunities

TL;DR

Abstract

Overseeing Agents Without Constant Oversight: Challenges and Opportunities

Authors

TL;DR

Abstract

Table of Contents

Figures (9)