Table of Contents
Fetching ...

CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning

Yuqi Zhou, Shuai Wang, Sunhao Dai, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Jun Xu

TL;DR

CHOP tackles the subtask planning bottleneck in VLM-driven mobile GUI assistants by introducing basis subtasks extracted from human-performed sequences. It constrains the plan agent to decompose tasks into a fixed, high-frequency basis space and leverages a dedicated action agent to execute basis-subtask steps, with memory and Aria-UI integration to map actions to precise GUI coordinates. On English (CHOP-En) and Chinese (CHOP-ZH) datasets across 20 apps, CHOP achieves state-of-the-art effectiveness and efficiency, improving both subtasks quality and overall task execution speed, while ablations confirm the value of basis subtasks and documentation. The work provides public data and code, offering a scalable approach to GUI task automation that generalizes across apps and languages, with potential impact on mobile automation, AI assistants, and HCI research.

Abstract

The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human-like actions to address user requirements. Current VLM-based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high-level goals with low-level executable actions, is crucial for task completion but faces two challenges: ineffective subtasks that lower-level agent cannot execute and inefficient subtasks that fail to contribute to the completion of the higher-level task. These challenges stem from VLM's lack of experience in decomposing subtasks within GUI scenarios in multi-agent architecture. To address these, we propose a new mobile assistant architecture with constrained high-frequency o}ptimized planning (CHOP). Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at https://github.com/Yuqi-Zhou/CHOP

CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning

TL;DR

CHOP tackles the subtask planning bottleneck in VLM-driven mobile GUI assistants by introducing basis subtasks extracted from human-performed sequences. It constrains the plan agent to decompose tasks into a fixed, high-frequency basis space and leverages a dedicated action agent to execute basis-subtask steps, with memory and Aria-UI integration to map actions to precise GUI coordinates. On English (CHOP-En) and Chinese (CHOP-ZH) datasets across 20 apps, CHOP achieves state-of-the-art effectiveness and efficiency, improving both subtasks quality and overall task execution speed, while ablations confirm the value of basis subtasks and documentation. The work provides public data and code, offering a scalable approach to GUI task automation that generalizes across apps and languages, with potential impact on mobile automation, AI assistants, and HCI research.

Abstract

The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human-like actions to address user requirements. Current VLM-based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high-level goals with low-level executable actions, is crucial for task completion but faces two challenges: ineffective subtasks that lower-level agent cannot execute and inefficient subtasks that fail to contribute to the completion of the higher-level task. These challenges stem from VLM's lack of experience in decomposing subtasks within GUI scenarios in multi-agent architecture. To address these, we propose a new mobile assistant architecture with constrained high-frequency o}ptimized planning (CHOP). Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at https://github.com/Yuqi-Zhou/CHOP

Paper Structure

This paper contains 31 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Execution flowchart for VLM-based assistant.
  • Figure 2: Illustration of the VLM-based GUI assistant framework with basis subtask extraction.
  • Figure 3: Subtask quality comparison with and without basis subtask on matching and LLM-based evaluation.
  • Figure 4: Performances of CHOP with other methods.
  • Figure 5: SR of different methods across tasks of varying complexities, where complexity is defined by task length, with segments based on consecutive echo points.
  • ...and 1 more figures