ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation
Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang
TL;DR
This work tackles the limitation of mobile AI agents that pursue single-step action accuracy by introducing MobileReach and ReachAgent, which decompose tasks into page reaching and page operation subtasks and optimize GUI flows end-to-end. The model uses a two-stage training regime with an action alignment mechanism and reinforcement learning guided by a four-level reward scheme to produce compact, task-focused GUI flows. Empirical results on MobileReach and Auto-UI show notable improvements in IoU and text accuracy at both step- and task-levels, and ablations confirm the contribution of subtasks and RL to overall task success. The approach provides a practical path toward more robust, flow-aware mobile automation and offers a valuable dataset for subtask-driven GUI understanding.
Abstract
Recently, mobile AI agents have gained increasing attention. Given a task, mobile AI agents can interact with mobile devices in multiple steps and finally form a GUI flow that solves the task. However, existing agents tend to focus on most task-relevant elements at each step, leading to local optimal solutions and ignoring the overall GUI flow. To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks. Furthermore, we propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities. It utilizes the page reaching and page operation subtasks, along with reward-based preference GUI flows, to further enhance the agent. Experimental results show that ReachAgent significantly improves the IoU Acc and Text Acc by 7.12% and 7.69% on the step-level and 4.72% and 4.63% on the task-level compared to the SOTA agent. Our data and code will be released upon acceptance.
