Table of Contents
Fetching ...

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen

TL;DR

This work introduces GUI-RC, a test-time scaling method for GUI grounding that leverages region consistency across multiple model predictions to identify consensus regions and improve localization without any additional labeled data. It further extends this idea with GUI-RCPO, a test-time reinforcement learning approach that uses region-consistency signals as self-supervised rewards to iteratively refine predictions on unlabeled data. Across diverse models and GUI benchmarks (ScreenSpot, ScreenSpot-v2, ScreenSpot-Pro), GUI-RC yields consistent 2–3% gains, while GUI-RCPO delivers an additional 3–6% improvement and demonstrates strong out-of-distribution generalization. The results reveal the potential of inference-time optimization for data-efficient GUI agents, with a robust, label-free pathway to progressively enhance grounding performance through self-bootstrapping, even after initial GUI-task training. Together, GUI-RC and GUI-RCPO offer a complementary alternative to traditional train-time supervision for robust, scalable GUI automation.

Abstract

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), transforming these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: using only 1,272 unlabeled data, GUI-RCPO achieves 3-6% accuracy improvements across various architectures on ScreenSpot benchmarks. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more data-efficient GUI agents.

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

TL;DR

This work introduces GUI-RC, a test-time scaling method for GUI grounding that leverages region consistency across multiple model predictions to identify consensus regions and improve localization without any additional labeled data. It further extends this idea with GUI-RCPO, a test-time reinforcement learning approach that uses region-consistency signals as self-supervised rewards to iteratively refine predictions on unlabeled data. Across diverse models and GUI benchmarks (ScreenSpot, ScreenSpot-v2, ScreenSpot-Pro), GUI-RC yields consistent 2–3% gains, while GUI-RCPO delivers an additional 3–6% improvement and demonstrates strong out-of-distribution generalization. The results reveal the potential of inference-time optimization for data-efficient GUI agents, with a robust, label-free pathway to progressively enhance grounding performance through self-bootstrapping, even after initial GUI-task training. Together, GUI-RC and GUI-RCPO offer a complementary alternative to traditional train-time supervision for robust, scalable GUI automation.

Abstract

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), transforming these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: using only 1,272 unlabeled data, GUI-RCPO achieves 3-6% accuracy improvements across various architectures on ScreenSpot benchmarks. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more data-efficient GUI agents.

Paper Structure

This paper contains 40 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our test-time scaling methods for GUI grounding. Upper: GUI-RC aggregates $K$ sampled predictions through spatial voting to extract a consensus region, achieving more accurate localization than greedy decoding. Lower: GUI-RCPO computes region consistency rewards based on the voting heatmap and uses these self-supervised signals to update model parameters, enabling label-free improvement through test-time reinforcement learning.
  • Figure 2: Ablation study results on ScreenSpot-v2 with varying temperature, sampling number, and hyperparameter $\alpha$.
  • Figure 3: Accuracy (%) across training steps of Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct throughout GUI-RCPO.
  • Figure 4: Case studies of how GUI-RC mitigates two types of hallucinations in GUI grounding. In each row, the left image shows the Greedy Decoding result, where the blue box denotes the ground truth, and the red box denotes the model's prediction. The right image shows the spatial voting heatmap obtained after applying GUI-RC. The brighter regions reflect higher region consistency, and the green box denotes the extracted consensus region.