Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent
Fanglin Mo, Junzhe Chen, Haoxuan Zhu, Xuming Hu
TL;DR
This paper tackles the instability of multi-step task planning in mobile GUI agents by introducing SPlanner, an EFSM-based planning module that models individual apps to produce stable, actionable execution plans. The approach parses user instructions, solves app-specific EFSMs via BFS to derive execution paths, and polishes these paths into natural-language plans that guide a vision-language model through step-by-step GUI interactions. Evaluated on the AndroidWorld benchmark, SPlanner with a generalist VLM achieves 63.8% task success, a substantial 28.8-point improvement over non-planned baselines, and competitive performance relative to specialized GUI agents. The work highlights the benefits of integrating symbolic EFSMs with LLM-based reasoning in a plug-and-play framework, while noting current limitations in manual EFSM construction and adherence of VLMs to plans. Future work aims to automate EFSM generation and enhance instruction parsing to improve scalability and robustness in real-world deployments.
Abstract
Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become "lost" during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.
