Table of Contents
Fetching ...

ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, Kang Min Yoo

TL;DR

ReGUIDE tackles the data inefficiency of pixel-level GUI grounding for Multimodal LLM agents by combining self-generated reasoning through online reinforcement learning with spatial priors that enforce equivariance under transformations. A two-stage training pipeline first learns to explain localization using GRPO with a tailored reward and then enforces global-local consistency across transformations, while a test-time spatial search using KDE crops and aggregates multiple predictions to robustly locate GUI elements. Empirically, ReGUIDE achieves state-of-the-art grounding performance on ScreenSpot benchmarks using only ~0.2% of the data needed by comparable baselines and translates these gains into improved agentic task success. The work demonstrates practical, data-efficient grounding enhancements with clear pathways for extending planning hierarchies and incorporating safety measures for real-world deployment.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).

ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

TL;DR

ReGUIDE tackles the data inefficiency of pixel-level GUI grounding for Multimodal LLM agents by combining self-generated reasoning through online reinforcement learning with spatial priors that enforce equivariance under transformations. A two-stage training pipeline first learns to explain localization using GRPO with a tailored reward and then enforces global-local consistency across transformations, while a test-time spatial search using KDE crops and aggregates multiple predictions to robustly locate GUI elements. Empirically, ReGUIDE achieves state-of-the-art grounding performance on ScreenSpot benchmarks using only ~0.2% of the data needed by comparable baselines and translates these gains into improved agentic task success. The work demonstrates practical, data-efficient grounding enhancements with clear pathways for extending planning hierarchies and incorporating safety measures for real-world deployment.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).
Paper Structure (27 sections, 6 equations, 7 figures, 13 tables)

This paper contains 27 sections, 6 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Overview of training ReGUIDE.Left: The model rolls out multiple (reasoning, coordinate) pairs; rewards the point that falls inside the ground-truth box. Right: Trains the model on paired full-image and cropped views, sharing the reasoning tokens while adjusting the coordinate, which provides multi-view consistency and improves grounding performance.
  • Figure 2: Likelihood of coordinate tokens spreaded gaussian naturally.
  • Figure 3: Ablation studies on grounding accuracy (%) w.r.t. (a) crop size $W_{\mathtt{RoI}}$, (b) number of sampled predictions $N$, and (c) decoding temperature $T$.
  • Figure 4: Attention map between text tokens, X-axis is key tokens, Y-axis is query tokens. In the image, I denotes instruction, R denotes reasoning rationales, and A denotes Answer.
  • Figure 5: Grounding accuracy (%) on ScreenSpotcheng2024screenspot as the UGround gou2024uground training set grows from 3.2 K to 20 K samples. We trained from Qwen-2.5-VL-3B.
  • ...and 2 more figures