Table of Contents
Fetching ...

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig

TL;DR

The paper introduces CowPilot, a Chrome-extension framework that enables autonomous and human-in-the-loop web navigation by alternating actions between an LLM agent and a human supervisor. It formalizes a two-agent system with a suggest-then-execute workflow and a suite of evaluation metrics for task success and collaboration, capturing both end-to-end performance and interaction dynamics. Empirical results across five websites show Copilot mode with GPT-4o achieving up to 95% task accuracy with minimal human input, and demonstrate that agents can drive a substantial portion of task success. The framework also positions CowPilot as a tool for data collection and agent evaluation, enabling rigorous studies of human-agent collaboration in real-world web tasks.

Abstract

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

TL;DR

The paper introduces CowPilot, a Chrome-extension framework that enables autonomous and human-in-the-loop web navigation by alternating actions between an LLM agent and a human supervisor. It formalizes a two-agent system with a suggest-then-execute workflow and a suite of evaluation metrics for task success and collaboration, capturing both end-to-end performance and interaction dynamics. Empirical results across five websites show Copilot mode with GPT-4o achieving up to 95% task accuracy with minimal human input, and demonstrate that agents can drive a substantial portion of task success. The framework also positions CowPilot as a tool for data collection and agent evaluation, enabling rigorous studies of human-agent collaboration in real-world web tasks.

Abstract

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

Paper Structure

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A step-by-step illustration of how human intervention enables the agent to overcome a failure point during task execution. The figure uses gray edges to represent the agent's autonomous actions and blue edges to indicate human intervention. The process begins with the agent attempting the task independently (Step ) and navigating to the interface to list available forums (Step ). At this stage, the agent gets stuck, unable to locate the desired 'space' forum. A human intervenes (Step ), guiding the agent to the correct forum. The user then resumes the agent's operation (Step ), allowing it to retrieve the required post and complete the task by navigating to the comments section (Step )."
  • Figure 2: Example of CowPilot's core interaction modules during task execution. At step , the LLM agent generates a suggestion, highlighting the textual description and the UI element where the action will be performed. At step , the user identifies an erroneous action, chooses to pause the LLM agent, and proceeds to perform corrective actions manually (step , e.g., typing in the textfield, highlighted in blue). At step , the user chooses to resume the LLM agent, allowing it to continue generating actions. The agent resumes successfully and proceeds to execute subsequent steps autonomously (step ).
  • Figure 3: Correlation between Human Step Count and End-to-End Task Accuracy.
  • Figure 4: Screenshot of CowPilot evaluation result page. After each task is completed, the evaluation metric values are shown as summary.
  • Figure 5: Prompt for Action Transformation from Raw Event to Agent Action Space.