Table of Contents
Fetching ...

Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

Fanglin Mo, Junzhe Chen, Haoxuan Zhu, Xuming Hu

TL;DR

This paper tackles the instability of multi-step task planning in mobile GUI agents by introducing SPlanner, an EFSM-based planning module that models individual apps to produce stable, actionable execution plans. The approach parses user instructions, solves app-specific EFSMs via BFS to derive execution paths, and polishes these paths into natural-language plans that guide a vision-language model through step-by-step GUI interactions. Evaluated on the AndroidWorld benchmark, SPlanner with a generalist VLM achieves 63.8% task success, a substantial 28.8-point improvement over non-planned baselines, and competitive performance relative to specialized GUI agents. The work highlights the benefits of integrating symbolic EFSMs with LLM-based reasoning in a plug-and-play framework, while noting current limitations in manual EFSM construction and adherence of VLMs to plans. Future work aims to automate EFSM generation and enhance instruction parsing to improve scalability and robustness in real-world deployments.

Abstract

Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become "lost" during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.

Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

TL;DR

This paper tackles the instability of multi-step task planning in mobile GUI agents by introducing SPlanner, an EFSM-based planning module that models individual apps to produce stable, actionable execution plans. The approach parses user instructions, solves app-specific EFSMs via BFS to derive execution paths, and polishes these paths into natural-language plans that guide a vision-language model through step-by-step GUI interactions. Evaluated on the AndroidWorld benchmark, SPlanner with a generalist VLM achieves 63.8% task success, a substantial 28.8-point improvement over non-planned baselines, and competitive performance relative to specialized GUI agents. The work highlights the benefits of integrating symbolic EFSMs with LLM-based reasoning in a plug-and-play framework, while noting current limitations in manual EFSM construction and adherence of VLMs to plans. Future work aims to automate EFSM generation and enhance instruction parsing to improve scalability and robustness in real-world deployments.

Abstract

Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become "lost" during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.

Paper Structure

This paper contains 15 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The SPlanner workflow consists of three main stages. First, Application Modeling via EFSM: (a) Prior to deployment, each target application is manually modeled into an EFSM, described using a set of state tables and state transition tables. Second, Plan Generation: (b) Upon receiving a user instruction, SPlanner processes it through three subprocedures — Instruction Parsing, EFSM Solving, and Path Polishing — to generate a detailed execution plan, with superscripts of $a$ and $T$ indicating their respective order of generation. Third, Task Execution with VLM: (c) We employs a VLM to execute the task by sequentially observing mobile device screenshots and following the generated plan, step by step, until the task is completed.
  • Figure 2: Task success rates of SPlanner and baseline methods on AndroidWorld. For clarity of presentation, darker colors are used to indicate higher success rates, and the exact values are annotated on the corresponding bars.