Table of Contents
Fetching ...

GUI-PRA: Process Reward Agent for GUI Tasks

Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, Shengyu Zhang

TL;DR

The paper tackles the instability of GUI agents on long-horizon tasks by introducing GUI-PRA, a training-free supervisor that converts standard Process Reward Models into GUI-aware evaluators. It introduces a Dynamic Memory mechanism to compress long histories and an Adaptive UI Perception loop to ground evaluations in real-time visual evidence, using tools like OmniParser and Point for UI understanding. Through extensive experiments on AndroidWorld and MobileMiniWoB++, GUI-PRA yields significant improvements in success rates over both unguided and standard PRM baselines, especially on medium and hard tasks, and ablation studies confirm the necessity of its components. The work demonstrates that dynamic memory and active UI perception can substantially enhance the reliability and efficiency of automated GUI agents in dynamic environments, with implications for real-world task automation and safe AI supervision.

Abstract

Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle'' phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.

GUI-PRA: Process Reward Agent for GUI Tasks

TL;DR

The paper tackles the instability of GUI agents on long-horizon tasks by introducing GUI-PRA, a training-free supervisor that converts standard Process Reward Models into GUI-aware evaluators. It introduces a Dynamic Memory mechanism to compress long histories and an Adaptive UI Perception loop to ground evaluations in real-time visual evidence, using tools like OmniParser and Point for UI understanding. Through extensive experiments on AndroidWorld and MobileMiniWoB++, GUI-PRA yields significant improvements in success rates over both unguided and standard PRM baselines, especially on medium and hard tasks, and ablation studies confirm the necessity of its components. The work demonstrates that dynamic memory and active UI perception can substantially enhance the reliability and efficiency of automated GUI agents in dynamic environments, with implications for real-world task automation and safe AI supervision.

Abstract

Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle'' phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.

Paper Structure

This paper contains 30 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of the GUI-PRA compared to a standard Process Reward Model (PRM). A standard PRM fails a GUI task due to context loss and lack of UI awareness. Our GUI-PRA overcomes these limitations with its Dynamic Memory and UI Tool Routing mechanisms to ensure success.
  • Figure 2: The overall workflow of GUI-PRA. (a) The Dynamic Memory module first processes the raw interaction history to generate a condensed summary. (b) Concurrently, the Adaptive UI Perception Mechanism actively reasons about the UI state to select the most appropriate tool for gathering grounded visual evidence. (c) For the final Best-of-N Selection, GUI-PRA integrates these two information streams along with the previous action and its score from the last step to evaluate and select the optimal candidate action.
  • Figure 3: A complete case of GUI-PRA guiding a GUI Agent to complete the 'ContactsNewContactDraft' task. The figure illustrates the parallel process flows, showing the agent's action trajectory (top row) and the continuous supervision provided by GUI-PRA (bottom row) across multiple steps until task completion.
  • Figure 4: A case study illustrating GUI-PRA's self-correction from an evaluation loop. The figure shows GUI-PRA assigning conflicting high scores to both the correct answer and a premature termination action (Steps 7-8), before correcting its judgment in Step 9 to successfully guide the agent to task completion.