Modeling Distinct Human Interaction in Web Agents

Faria Huq; Zora Zhiruo Wang; Zhanqiu Guo; Venu Arvind Arangarajan; Tianyue Ou; Frank Xu; Shuyan Zhou; Graham Neubig; Jeffrey P. Bigham

Modeling Distinct Human Interaction in Web Agents

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham

TL;DR

This work addresses the misalignment between autonomous web agents and user intent by modeling when humans should intervene during task execution. It introduces CowCorpus, a dataset of 400 real-user web trajectories, and builds intervention-aware language models (general and style-conditioned) to predict user interventions at each step, achieving a 61.4–63.4% improvement over baselines. The authors demonstrate practical impact by deploying intervention-aware agents, yielding a 26.5% increase in user-rated usefulness in live tasks. Collectively, the contributions show that encoding human-intervention patterns enables more adaptive, collaborative web agents that better align with user preferences and workflows.

Abstract

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

Modeling Distinct Human Interaction in Web Agents

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 11 figures, 10 tables)

This paper contains 30 sections, 6 equations, 11 figures, 10 tables.

Introduction
Problem Formulation: Human Intervention Modeling
Evaluation Metrics
CowCorpus: Collecting Human-Agent Collaborative Web Activities
Data Collection
Step-Level User Intervention
When Do Users Intervene?
Why Do Users Intervene?
Task-Level Interaction Patterns
Experiments: Modeling Human Intervention
Experiment Setup
Benchmarking Intervention Awareness in Autonomous Agents
Interaction Pattern Customization
Deploying Collaborative Web Agents
Related Work
...and 15 more sections

Figures (11)

Figure 1: In this paper, we present CowCorpus, a dataset of 400 real-user collaborative web trajectories that captures when and how humans intervene during execution, enabling intervention-aware agents that engage users only when needed.
Figure 2: Visual Illustration of how PTS is calculated. We measure the $L_2$ squared distance between the ground truth intervention and false-positive predictions. The score then penalizes based on the following distance.
Figure 3: Four distinct types of human-agent interaction patterns: Takeover, Hands-on, Hands-off, and Collaborative. We visualize the user groups using PCA (left), and describe the interaction mechanism of each group (right).
Figure 4: Perfect Timing Score on CowCorpus. Out of the proprietary models, Claude outperforms GPT-4o and Gemini-2.5. On the finetuned model, Gemma 27B significantly boosts the performance when finetuned on CowCorpus.
Figure 5: The heatmap shows the PTS score on the cluster-wise trained models for each of the three clusters. Models trained for corresponding clusters generally outperform the others, with the only exception of the Takeover group, which is analyzed in §\ref{['subsubsec:usergroup']}
...and 6 more figures

Modeling Distinct Human Interaction in Web Agents

TL;DR

Abstract

Modeling Distinct Human Interaction in Web Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (11)