Table of Contents
Fetching ...

Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

Zhiyuan Hu, Shiyun Xiong, Yifan Zhang, See-Kiong Ng, Anh Tuan Luu, Bo An, Shuicheng Yan, Bryan Hooi

TL;DR

GuidNav tackles GUI navigation with visual language models by introducing a process reward model that provides step-level feedback during inference, enabling per-step action optimization instead of relying solely on end-of-trajectory evaluation. The method is trained from human demonstrations and VLM self-play, and it guides action selection at every inference step, with trajectory refinement via evaluation, reflection, and retry. Empirical results across AitW, GUI Odyssey, and Mind2Web show consistent improvements in static per-step action accuracy and dynamic task success, with notable gains when combined with trajectory reflection. The work demonstrates strong generalization across mobile and web GUI tasks and discusses efficiency, integration with autonomous refinement, and directions for broader benchmarking and real-world deployment.

Abstract

Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

TL;DR

GuidNav tackles GUI navigation with visual language models by introducing a process reward model that provides step-level feedback during inference, enabling per-step action optimization instead of relying solely on end-of-trajectory evaluation. The method is trained from human demonstrations and VLM self-play, and it guides action selection at every inference step, with trajectory refinement via evaluation, reflection, and retry. Empirical results across AitW, GUI Odyssey, and Mind2Web show consistent improvements in static per-step action accuracy and dynamic task success, with notable gains when combined with trajectory reflection. The work demonstrates strong generalization across mobile and web GUI tasks and discusses efficiency, integration with autonomous refinement, and directions for broader benchmarking and real-world deployment.

Abstract

Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

Paper Structure

This paper contains 31 sections, 6 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of GuidNav.
  • Figure 2: The performance curve across different trial numbers shows the impact of refinement techniques. 'DP+AR' represents the combination of direct prompting for action at each step, followed by AR at the end of each trajectory trial. 'GuidNav+AR' integrates TopK action selection guided by a reward model, with AR applied at the end of each trajectory trial. 'TopK+AP' refers to TopK method integrated with AR.
  • Figure 3: Example of case study. Access the accessibility settings.