Table of Contents
Fetching ...

Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu

TL;DR

This work rethinks GUI grounding by treating zoom as a principled, training-free prior that can dramatically improve fine-grained element localization. The proposed ZoomClick method uses a simple three-stage process—pre-zoom, iterative shrinking, and early termination—along with a fixed shrink ratio and minimum crop size to reliably exploit spatial priors across models and datasets. To understand zoom dynamics, the authors introduce GUIZoom-Bench, a behavior-based benchmark that characterizes how zoom helps or hinders grounding across difficulty and reliability dimensions. Empirical results show substantial improvements on ScreenSpot-Pro and UI-Vision, with notable gains even for smaller backbones, and they provide extensive ablations and implementation details to guide future zoom-aware GUI grounding research. Together, ZoomClick and GUIZoom-Bench illuminate how to harness zoom for robust, scalable GUI grounding and offer a diagnostic framework for advancing zoom-centered strategies.

Abstract

Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.

Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

TL;DR

This work rethinks GUI grounding by treating zoom as a principled, training-free prior that can dramatically improve fine-grained element localization. The proposed ZoomClick method uses a simple three-stage process—pre-zoom, iterative shrinking, and early termination—along with a fixed shrink ratio and minimum crop size to reliably exploit spatial priors across models and datasets. To understand zoom dynamics, the authors introduce GUIZoom-Bench, a behavior-based benchmark that characterizes how zoom helps or hinders grounding across difficulty and reliability dimensions. Empirical results show substantial improvements on ScreenSpot-Pro and UI-Vision, with notable gains even for smaller backbones, and they provide extensive ablations and implementation details to guide future zoom-aware GUI grounding research. Together, ZoomClick and GUIZoom-Bench illuminate how to harness zoom for robust, scalable GUI grounding and offer a diagnostic framework for advancing zoom-centered strategies.

Abstract

Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.

Paper Structure

This paper contains 38 sections, 7 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: Performance of existing GUI grounding methods on the ScreenSpot-Proli2025screenspotproguigroundingprofessional benchmark, where GTA1-32Byang2025gta1guitesttimescaling serves as the previous state-of-the-art method. Right: Comparison of methods on our proposed GUIZoom-Bench.
  • Figure 2: Model Framework of ZoomClick. Min_crop_size \ref{['term:min_crop_size']} represents the lower bound of viewport during Iterative Narrowing.
  • Figure 3: Data examples of each category in GUIZoom-Bench.
  • Figure 4: Data organization in our proposed GUIZoom-Bench.
  • Figure 5: Qualitative results of ZoomClick on ScreenSpot-Pro. Left: depth-2 zoom resolves the error from depth 1. Center: depth-3 zoom resolves the errors from depths 1–2. Right: depth-4 zoom resolves the errors from depths 1–3.
  • ...and 5 more figures