Table of Contents
Fetching ...

\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, Caiwen Ding

TL;DR

The paper tackles the challenge of robust, pixel-level visual grounding for GUI agents on high-resolution screens. It introduces GUI-Spotlight, a think-with-image model that iteratively narrows attention using crop, extract, and find_color tools within a three-stage GSPO-based reinforcement learning framework, trained on a curated high-resolution dataset. The approach achieves strong data efficiency, delivering 52.8% accuracy on ScreenSpot-Pro with only 18.5K training samples and competitive results across UI-Vision and OS-wide benchmarks, while improving training stability and cross-domain transfer. This work significantly advances reliable pointer-level actions in real-world GUI automation and provides practical guidance for building coordinated, tool-augmented grounding agents.

Abstract

Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).

\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

TL;DR

The paper tackles the challenge of robust, pixel-level visual grounding for GUI agents on high-resolution screens. It introduces GUI-Spotlight, a think-with-image model that iteratively narrows attention using crop, extract, and find_color tools within a three-stage GSPO-based reinforcement learning framework, trained on a curated high-resolution dataset. The approach achieves strong data efficiency, delivering 52.8% accuracy on ScreenSpot-Pro with only 18.5K training samples and competitive results across UI-Vision and OS-wide benchmarks, while improving training stability and cross-domain transfer. This work significantly advances reliable pointer-level actions in real-world GUI automation and provides practical guidance for building coordinated, tool-augmented grounding agents.

Abstract

Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).

Paper Structure

This paper contains 23 sections, 6 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: GUI-Spotlight pipeline. Orange text denotes the user’s original input; blue text indicates the image provided in each dialogue turn; red text indicates the command generated by the model in that turn. Red boxes highlight the newly cropped images produced by the model’s command.
  • Figure 2: ScreenSpot-Pro accuracy over training.
  • Figure 3: Left: Impact of different RL variants. Right: A comparison of algorithm training dynamics . denotes discarded. Items ①–⑦ are described in the first paragraph of Section \ref{['sec:RL_selection']}.
  • Figure 4: Left: Comparison of dense and sparse Answer rewards. Right: Comparison of different Crop/Extract reward ratios. b_m: $\text{bonus}_{\max}$
  • Figure 5: Comparison of multi-step reasoning strategies. UI-TARS-1.5-7B is used as the initial model: ① multi-turn conversational inference; ② repeated single-turn inference; ③ GUI-Spotlight.