Table of Contents
Fetching ...

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Suyuchen Wang, Tianyu Zhang, Ahmed Masry, Christopher Pal, Spandana Gella, Bang Liu, Perouz Taslakian

TL;DR

This work tackles GUI grounding by addressing the core problem: implicit position-to-pixel mappings poorly generalize to high-resolution displays. It introduces Ruler tokens to provide explicit coordinate references that the model can reference and copy, transforming coordinate prediction into a retrieval task with simple bounded adjustments. It also proposes Interleaved Multidimensional Rotary Positional Embedding (I-MRoPE) to distribute frequency content evenly across spatial dimensions, yielding balanced spatial representations. Empirically, the approach yields consistent gains across ScreenSpot datasets, with the largest improvements on high-resolution benchmarks, and incurs minimal overhead, demonstrating practical applicability for robust GUI automation. Overall, explicit spatial guidance and balanced positional encoding markedly improve pixel-precise grounding in diverse resolutions and platforms.

Abstract

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

TL;DR

This work tackles GUI grounding by addressing the core problem: implicit position-to-pixel mappings poorly generalize to high-resolution displays. It introduces Ruler tokens to provide explicit coordinate references that the model can reference and copy, transforming coordinate prediction into a retrieval task with simple bounded adjustments. It also proposes Interleaved Multidimensional Rotary Positional Embedding (I-MRoPE) to distribute frequency content evenly across spatial dimensions, yielding balanced spatial representations. Empirically, the approach yields consistent gains across ScreenSpot datasets, with the largest improvements on high-resolution benchmarks, and incurs minimal overhead, demonstrating practical applicability for robust GUI automation. Overall, explicit spatial guidance and balanced positional encoding markedly improve pixel-precise grounding in diverse resolutions and platforms.

Abstract

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

Paper Structure

This paper contains 23 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A comparison between traditional direct positional embedding-to-pixel coordinate mapping and Ruler's explicit coordinate mapping.
  • Figure 2: Model architecture. Our framework augments vision-language models with two key innovations: (1) Ruler tokens that provide explicit position-to-coordinate mappings, transforming coordinate prediction from regression to retrieval, and (2) I-MRoPE that rebalances positional embeddings by interleaving frequency components across spatial dimensions, ensuring equal representational capacity for width and height, and
  • Figure 3: Ablation study on Ruler token intervals $s$ across different benchmarks and training settings.
  • Figure 4: Analysis of the ratio of the number of Ruler tokens to the number of image tokens under common mobile phone and computer screen resolutions for different Ruler intervals. All numbers are in percentages (%).