CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning
Yuqi Zhou, Shuai Wang, Sunhao Dai, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Jun Xu
TL;DR
CHOP tackles the subtask planning bottleneck in VLM-driven mobile GUI assistants by introducing basis subtasks extracted from human-performed sequences. It constrains the plan agent to decompose tasks into a fixed, high-frequency basis space and leverages a dedicated action agent to execute basis-subtask steps, with memory and Aria-UI integration to map actions to precise GUI coordinates. On English (CHOP-En) and Chinese (CHOP-ZH) datasets across 20 apps, CHOP achieves state-of-the-art effectiveness and efficiency, improving both subtasks quality and overall task execution speed, while ablations confirm the value of basis subtasks and documentation. The work provides public data and code, offering a scalable approach to GUI task automation that generalizes across apps and languages, with potential impact on mobile automation, AI assistants, and HCI research.
Abstract
The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human-like actions to address user requirements. Current VLM-based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high-level goals with low-level executable actions, is crucial for task completion but faces two challenges: ineffective subtasks that lower-level agent cannot execute and inefficient subtasks that fail to contribute to the completion of the higher-level task. These challenges stem from VLM's lack of experience in decomposing subtasks within GUI scenarios in multi-agent architecture. To address these, we propose a new mobile assistant architecture with constrained high-frequency o}ptimized planning (CHOP). Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at https://github.com/Yuqi-Zhou/CHOP
