Table of Contents
Fetching ...

POINTS-GUI-G: GUI-Grounding Journey

Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou

TL;DR

This work addresses GUI grounding for vision-language GUI agents by building POINTS-GUI-G-8B from a minimal grounding base (POINTS-1.5) and optimizing a full GUI grounding pipeline. It introduces a three-pillar approach—Refined Data Engineering, Improved Training Strategies, and Reinforcement Learning with Verifiable Rewards (RLVR)—to push state-of-the-art performance across desktop, mobile, and web benchmarks. Key contributions include a unified GUI data pipeline with noise filtering and complexity control, resolution-aware training to close train-test gaps, and a GRPO-based RL framework with verifiable rewards for precise element localization. The results demonstrate robust gains across five benchmarks and offer a compact, open-source 8B model that serves as a strong foundation for end-to-end GUI agent development.

Abstract

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

POINTS-GUI-G: GUI-Grounding Journey

TL;DR

This work addresses GUI grounding for vision-language GUI agents by building POINTS-GUI-G-8B from a minimal grounding base (POINTS-1.5) and optimizing a full GUI grounding pipeline. It introduces a three-pillar approach—Refined Data Engineering, Improved Training Strategies, and Reinforcement Learning with Verifiable Rewards (RLVR)—to push state-of-the-art performance across desktop, mobile, and web benchmarks. Key contributions include a unified GUI data pipeline with noise filtering and complexity control, resolution-aware training to close train-test gaps, and a GRPO-based RL framework with verifiable rewards for precise element localization. The results demonstrate robust gains across five benchmarks and offer a compact, open-source 8B model that serves as a strong foundation for end-to-end GUI agent development.

Abstract

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
Paper Structure (29 sections, 4 equations, 10 figures, 5 tables)

This paper contains 29 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison with existing models of comparable size on ScreenSpot-v2 wu2024atlas, SreenSpot-Pro li2025screenspotpro, MMBench-GUI-L2 Wang2025MMBenchGUIHM, and OSWorld-G xie2025scalingcomputerusegroundinguser.
  • Figure 2: The three-stage data engineering pipeline: Data preprocessing, data filtering, and complexity enhancement.
  • Figure 3: Reformatted samples from existing open-source datasets. All spatial annotations are unified to a $[0, 1]$ scale.
  • Figure 4: Some samples filtered by our proposed filtering strategy. We place the original annotation onto the image.
  • Figure 5: Easy data samples from existing datasets filtered out by our complexity enhancement strategy.
  • ...and 5 more figures