Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen
TL;DR
This work introduces GUI-RC, a test-time scaling method for GUI grounding that leverages region consistency across multiple model predictions to identify consensus regions and improve localization without any additional labeled data. It further extends this idea with GUI-RCPO, a test-time reinforcement learning approach that uses region-consistency signals as self-supervised rewards to iteratively refine predictions on unlabeled data. Across diverse models and GUI benchmarks (ScreenSpot, ScreenSpot-v2, ScreenSpot-Pro), GUI-RC yields consistent 2–3% gains, while GUI-RCPO delivers an additional 3–6% improvement and demonstrates strong out-of-distribution generalization. The results reveal the potential of inference-time optimization for data-efficient GUI agents, with a robust, label-free pathway to progressively enhance grounding performance through self-bootstrapping, even after initial GUI-task training. Together, GUI-RC and GUI-RCPO offer a complementary alternative to traditional train-time supervision for robust, scalable GUI automation.
Abstract
Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), transforming these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: using only 1,272 unlabeled data, GUI-RCPO achieves 3-6% accuracy improvements across various architectures on ScreenSpot benchmarks. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more data-efficient GUI agents.
