Table of Contents
Fetching ...

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang

TL;DR

This work tackles GUI navigation with Multimodal LLMs under non-stationary real-world conditions, costly data curation, and sparse intermediate supervision. It introduces agentic-Q estimation to assign step-wise returns $G_i$ to actions and Step-Wise Policy Optimization (SWPO) to update policies using self-generated, step-level feedback in a critic-free setting, enabling decoupled, stable RL. Data-curation strategies (detection-rerunning, data stratification) and a cold-start phase (SFT on grounding data) bootstrap learning, while empirical results show the 9B Ovis2.5-9B model achieving state-of-the-art or competitive performance on GUI grounding and navigation benchmarks, including out-of-distribution online tasks. Overall, the framework delivers improved data efficiency, generalization, and boundary-traversal capabilities, making it practical for real-time GUI agents operating on live websites.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

TL;DR

This work tackles GUI navigation with Multimodal LLMs under non-stationary real-world conditions, costly data curation, and sparse intermediate supervision. It introduces agentic-Q estimation to assign step-wise returns to actions and Step-Wise Policy Optimization (SWPO) to update policies using self-generated, step-level feedback in a critic-free setting, enabling decoupled, stable RL. Data-curation strategies (detection-rerunning, data stratification) and a cold-start phase (SFT on grounding data) bootstrap learning, while empirical results show the 9B Ovis2.5-9B model achieving state-of-the-art or competitive performance on GUI grounding and navigation benchmarks, including out-of-distribution online tasks. Overall, the framework delivers improved data efficiency, generalization, and boundary-traversal capabilities, making it practical for real-time GUI agents operating on live websites.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
Paper Structure (13 sections, 11 equations, 7 figures, 4 tables)

This paper contains 13 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Performance comparisons of agentic Ovis2.5-9B series and other contenders.
  • Figure 2: Illustration of our framework with three stages: (i) $\text{Ovis2.5}_\text{SFT}$ is trained on grounding data and a set of expert web navigation trajectories; (ii) we collect state-action trajectories by $\text{Ovis2.5}_\text{SFT}$ itself, and train our agentic-Q model in a binary classification manner; (iii) we optimize the policy with self-generated step-wise trajectories under the guidance of our agentic-Q model.
  • Figure 3: Example of boundary traversal. The GUI agent is prompted to retrieve the price of iPhone 14 Plus at www.apple.com. However, due to recent updates, this product has been removed. For the first $34$ steps, the agent persistently searches within the site but fails to locate target information. At step $35$, it switches strategies and conducts an external search on Google, completing the task by step $39$.
  • Figure 4: Comparisons under different setting: (a) compares average steps required to complete tasks in WebVoyager; (b) shows the policy entropy of $\text{Ovis2.5}_\text{S-GRPO}$ under varying sliding window sizes; (c) compares the overall performance of $\text{Ovis2.5}_\text{S-GRPO}$ across different window sizes.
  • Figure 5: Trajectory for the task on Huggingface.
  • ...and 2 more figures