Table of Contents
Fetching ...

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

TL;DR

This work tackles GUI grounding by replacing sparse binary rewards with continuous Gaussian rewards that reflect the spatial nature of GUI interactions. It introduces Gaussian point rewards for precise localization and Gaussian coverage rewards for regional targeting, coupled with an adaptive variance mechanism to handle diverse element sizes, all integrated within GRPO. Empirical results on ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro show substantial improvements over state-of-the-art RL-based methods, including large gains on high-resolution interfaces, and ablation studies confirm the necessity of both Gaussian components and adaptive variance. The findings suggest that continuous, geometry-aware reward modeling yields more robust, transferable spatial reasoning for GUI interaction tasks.

Abstract

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

TL;DR

This work tackles GUI grounding by replacing sparse binary rewards with continuous Gaussian rewards that reflect the spatial nature of GUI interactions. It introduces Gaussian point rewards for precise localization and Gaussian coverage rewards for regional targeting, coupled with an adaptive variance mechanism to handle diverse element sizes, all integrated within GRPO. Empirical results on ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro show substantial improvements over state-of-the-art RL-based methods, including large gains on high-resolution interfaces, and ablation studies confirm the necessity of both Gaussian components and adaptive variance. The findings suggest that continuous, geometry-aware reward modeling yields more robust, transferable spatial reasoning for GUI interaction tasks.

Abstract

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

Paper Structure

This paper contains 30 sections, 8 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: GUI grounding performance and human click behavior. Left: Performance comparison of various models on ScreenSpot-Pro. Right: Human click distribution from AITW rawles2023androidwildlargescaledataset reveals natural Gaussian patterns around target centers ($\mu=0.111$, $\sigma=0.429$), validating our design choice of continuous Gaussian rewards over discrete binary feedback.
  • Figure 2: Comparison of reward modeling strategies. (a-c) Existing methods treat GUI elements as abstract points with binary or distance-based rewards, while (d) our Gaussian approach provides continuous point and coverage rewards that naturally align with human clicking behavior.
  • Figure 3: GUI Gaussian Grounding Rewards (GUI-G$^2$). Our framework transforms GUI grounding through continuous Gaussian modeling. Given a task instruction and screenshot, the policy model generates multiple predictions that are evaluated using our dual reward mechanism. Gaussian Point Rewards assess localization precision while Gaussian Coverage Rewards measure spatial overlap, together providing dense learning signals that guide policy optimization.
  • Figure 4: Reward comparison analysis. Left: Training dynamics of sparse reward variants (Point, IoU, Point+IoU) showing reward standard deviation and convergence patterns. Right: Distance to target center over training steps, where Gaussian rewards demonstrate monotonic convergence while Sparse rewards exhibit erratic fluctuations.
  • Figure 5: Hyperparameter sensitivity analysis for adaptive sigma ($\sigma$). Performance peaks at $\alpha = 0.5$ with 93.3% accuracy on Screenspot-v2.
  • ...and 6 more figures