Table of Contents
Fetching ...

Dynamic Planning for LLM-based Graphical User Interface Automation

Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang

TL;DR

This paper tackles the challenge of autonomous LLM-based smartphone GUI automation under dynamic interfaces by introducing Dynamic Planning of Thoughts (D-PoT), a two-stage framework that continually updates plans using environmental feedback and execution history. By integrating planning initialization with dynamic planning adjustment, D-PoT achieves substantial accuracy gains over a strong GPT-4V baseline on the AITW benchmark and demonstrates robust adaptation to unfamiliar tasks through fine-tuning. Key contributions include showing the effectiveness of dynamic planning in reducing hallucinations and enabling cross-domain generalization, and releasing code to facilitate adoption. The work highlights dynamic planning as a general mechanism for improving multi-step, multimodal GUI tasks in real-world settings with potential impact on accessibility and automated mobile workflows.

Abstract

The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action history.We show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% $\rightarrow$ 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT.

Dynamic Planning for LLM-based Graphical User Interface Automation

TL;DR

This paper tackles the challenge of autonomous LLM-based smartphone GUI automation under dynamic interfaces by introducing Dynamic Planning of Thoughts (D-PoT), a two-stage framework that continually updates plans using environmental feedback and execution history. By integrating planning initialization with dynamic planning adjustment, D-PoT achieves substantial accuracy gains over a strong GPT-4V baseline on the AITW benchmark and demonstrates robust adaptation to unfamiliar tasks through fine-tuning. Key contributions include showing the effectiveness of dynamic planning in reducing hallucinations and enabling cross-domain generalization, and releasing code to facilitate adoption. The work highlights dynamic planning as a general mechanism for improving multi-step, multimodal GUI tasks in real-world settings with potential impact on accessibility and automated mobile workflows.

Abstract

The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action history.We show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT.
Paper Structure (27 sections, 2 equations, 7 figures, 10 tables)

This paper contains 27 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The proposed dynamic planning method incorporates the execution history to adjust the plan to predict the action and subsequently supplements the execution history with the predicted action.
  • Figure 2: ReAct is misled by incorrect decisions.
  • Figure 3: Overview of D-PoT. In turn, $i$, the D-PoT makes a plan based on visual input and textual input, predicts the action to be performed, and then updates the execution history, and then proceeds to the next turn $i+1$.
  • Figure 4: Examples of six types of available actions.
  • Figure 5: The first common error is a bias of GPT-4V on mobile tasks. The red circles are the steps that GPT-4V performs in a dynamic schedule.
  • ...and 2 more figures