Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
TL;DR
This work tackles GUI navigation with Multimodal LLMs under non-stationary real-world conditions, costly data curation, and sparse intermediate supervision. It introduces agentic-Q estimation to assign step-wise returns $G_i$ to actions and Step-Wise Policy Optimization (SWPO) to update policies using self-generated, step-level feedback in a critic-free setting, enabling decoupled, stable RL. Data-curation strategies (detection-rerunning, data stratification) and a cold-start phase (SFT on grounding data) bootstrap learning, while empirical results show the 9B Ovis2.5-9B model achieving state-of-the-art or competitive performance on GUI grounding and navigation benchmarks, including out-of-distribution online tasks. Overall, the framework delivers improved data efficiency, generalization, and boundary-traversal capabilities, making it practical for real-time GUI agents operating on live websites.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
