Table of Contents
Fetching ...

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

Junhee Cho, Jihoon Kim, Daseul Bae, Jinho Choo, Youngjune Gwon, Yeong-Dae Kwon

TL;DR

CAAP introduces a modular GUI agent that solves desktop tasks using screenshots and keyboard/mouse actions by decoupling perception, reasoning, and execution. It leverages Context-Aware Action Planning prompting to systematically organize surrounding context and induce chain-of-thought reasoning in an LLM, reducing the need for large human demonstration datasets. Evaluations on MiniWoB++ and WebShop show state-of-the-art performance among image-only agents with 94.5% average success across 73 MiniWoB++ tasks and 62.3 task-score on WebShop, despite minimal supervision. The approach enables broad applicability across multiple desktop applications and cross-app coordination, offering a practical path toward robust, data-efficient automation agents.

Abstract

Software robots have long been used in Robotic Process Automation (RPA) to automate mundane and repetitive computer tasks. With the advent of Large Language Models (LLMs) and their advanced reasoning capabilities, these agents are now able to handle more complex or previously unseen tasks. However, LLM-based automation techniques in recent literature frequently rely on HTML source code for input or application-specific API calls for actions, limiting their applicability to specific environments. We propose an LLM-based agent that mimics human behavior in solving computer tasks. It perceives its environment solely through screenshot images, which are then converted into text for an LLM to process. By leveraging the reasoning capability of the LLM, we eliminate the need for large-scale human demonstration data typically required for model training. The agent only executes keyboard and mouse operations on Graphical User Interface (GUI), removing the need for pre-provided APIs to function. To further enhance the agent's performance in this setting, we propose a novel prompting strategy called Context-Aware Action Planning (CAAP) prompting, which enables the agent to thoroughly examine the task context from multiple perspectives. Our agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop, outperforming all previous studies of agents that rely solely on screen images. This method demonstrates potential for broader applications, particularly for tasks requiring coordination across multiple applications on desktops or smartphones, marking a significant advancement in the field of automation agents. Codes and models are accessible at https://github.com/caap-agent/caap-agent.

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

TL;DR

CAAP introduces a modular GUI agent that solves desktop tasks using screenshots and keyboard/mouse actions by decoupling perception, reasoning, and execution. It leverages Context-Aware Action Planning prompting to systematically organize surrounding context and induce chain-of-thought reasoning in an LLM, reducing the need for large human demonstration datasets. Evaluations on MiniWoB++ and WebShop show state-of-the-art performance among image-only agents with 94.5% average success across 73 MiniWoB++ tasks and 62.3 task-score on WebShop, despite minimal supervision. The approach enables broad applicability across multiple desktop applications and cross-app coordination, offering a practical path toward robust, data-efficient automation agents.

Abstract

Software robots have long been used in Robotic Process Automation (RPA) to automate mundane and repetitive computer tasks. With the advent of Large Language Models (LLMs) and their advanced reasoning capabilities, these agents are now able to handle more complex or previously unseen tasks. However, LLM-based automation techniques in recent literature frequently rely on HTML source code for input or application-specific API calls for actions, limiting their applicability to specific environments. We propose an LLM-based agent that mimics human behavior in solving computer tasks. It perceives its environment solely through screenshot images, which are then converted into text for an LLM to process. By leveraging the reasoning capability of the LLM, we eliminate the need for large-scale human demonstration data typically required for model training. The agent only executes keyboard and mouse operations on Graphical User Interface (GUI), removing the need for pre-provided APIs to function. To further enhance the agent's performance in this setting, we propose a novel prompting strategy called Context-Aware Action Planning (CAAP) prompting, which enables the agent to thoroughly examine the task context from multiple perspectives. Our agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop, outperforming all previous studies of agents that rely solely on screen images. This method demonstrates potential for broader applications, particularly for tasks requiring coordination across multiple applications on desktops or smartphones, marking a significant advancement in the field of automation agents. Codes and models are accessible at https://github.com/caap-agent/caap-agent.
Paper Structure (51 sections, 11 figures, 12 tables)

This paper contains 51 sections, 11 figures, 12 tables.

Figures (11)

  • Figure 1: The architecture and task-solving flow of the CAAP agent. The agent interprets a screenshot captured in the computer environment through the visual observer. The action proposer leverages the reasoning capabilities of the LLM to determine the next actions to take based on the observed state. Once actions are decided, the action executer applies the corresponding keyboard and mouse actions to the environment via the OS interface. This sequence of processes across the three modules continues until the task is completed.
  • Figure 2: Comparison of the masking methods for the original Pix2Struct and our UI element understanding model, and an example of the extracted features for our vision observer. While the original Pix2Struct outputs text in a HTML-like format, our model is finetuned to return JSON-style text.
  • Figure 3: Content design for the CAAP prompting.
  • Figure 4: The effects of different CAAP components on MiniWoB++ task performance.
  • Figure 5: A prompt example of generating a rationale for an agent action.
  • ...and 6 more figures