Table of Contents
Fetching ...

ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation

Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang

TL;DR

This work tackles the limitation of mobile AI agents that pursue single-step action accuracy by introducing MobileReach and ReachAgent, which decompose tasks into page reaching and page operation subtasks and optimize GUI flows end-to-end. The model uses a two-stage training regime with an action alignment mechanism and reinforcement learning guided by a four-level reward scheme to produce compact, task-focused GUI flows. Empirical results on MobileReach and Auto-UI show notable improvements in IoU and text accuracy at both step- and task-levels, and ablations confirm the contribution of subtasks and RL to overall task success. The approach provides a practical path toward more robust, flow-aware mobile automation and offers a valuable dataset for subtask-driven GUI understanding.

Abstract

Recently, mobile AI agents have gained increasing attention. Given a task, mobile AI agents can interact with mobile devices in multiple steps and finally form a GUI flow that solves the task. However, existing agents tend to focus on most task-relevant elements at each step, leading to local optimal solutions and ignoring the overall GUI flow. To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks. Furthermore, we propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities. It utilizes the page reaching and page operation subtasks, along with reward-based preference GUI flows, to further enhance the agent. Experimental results show that ReachAgent significantly improves the IoU Acc and Text Acc by 7.12% and 7.69% on the step-level and 4.72% and 4.63% on the task-level compared to the SOTA agent. Our data and code will be released upon acceptance.

ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation

TL;DR

This work tackles the limitation of mobile AI agents that pursue single-step action accuracy by introducing MobileReach and ReachAgent, which decompose tasks into page reaching and page operation subtasks and optimize GUI flows end-to-end. The model uses a two-stage training regime with an action alignment mechanism and reinforcement learning guided by a four-level reward scheme to produce compact, task-focused GUI flows. Empirical results on MobileReach and Auto-UI show notable improvements in IoU and text accuracy at both step- and task-levels, and ablations confirm the contribution of subtasks and RL to overall task success. The approach provides a practical path toward more robust, flow-aware mobile automation and offers a valuable dataset for subtask-driven GUI understanding.

Abstract

Recently, mobile AI agents have gained increasing attention. Given a task, mobile AI agents can interact with mobile devices in multiple steps and finally form a GUI flow that solves the task. However, existing agents tend to focus on most task-relevant elements at each step, leading to local optimal solutions and ignoring the overall GUI flow. To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks. Furthermore, we propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities. It utilizes the page reaching and page operation subtasks, along with reward-based preference GUI flows, to further enhance the agent. Experimental results show that ReachAgent significantly improves the IoU Acc and Text Acc by 7.12% and 7.69% on the step-level and 4.72% and 4.63% on the task-level compared to the SOTA agent. Our data and code will be released upon acceptance.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of a task and its subtasks and possible GUI flows. Green boxes represent the pages that need to be reached, and orange arrows represent the actions that need to be operated. To complete this task, the agent must reach 5 pages and do 3 operations.
  • Figure 2: The complete 9-step GUI flow for a task. Green boxes represent the pages that need to be reached, and green circles represent the operations that need to be done. Orange arrows are the actions in the golden flow. Blue arrows are the actions in other GUI flows. Both the orange and blue flows can complete the task.
  • Figure 3: Actions and tasks for a GUI flow. The step-by-step description provides a set of action history, where each step corresponds to an action performed on that GUI page. The brief task is a concise task description that aligns with this GUI flow.
  • Figure 4: The overview of our proposed ReachAgent. (a) Extracting action space from XML document. (b) In the first stage, the framework generates a GUI flow through multiple interaction steps with the GUI page. (c) In the second stage, it uses the reward function to construct preference data to further reinforce the SFT framework.
  • Figure 5: Two cases of generated GUI flow by ReachAgent and MobileVLM.
  • ...and 2 more figures