GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Yuqi Zhou; Sunhao Dai; Shuai Wang; Kaiwen Zhou; Qinglin Jia; Jun Xu

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu

TL;DR

This work critically reexamines R1-Zero-Like training for GUI grounding, identifying three core challenges: excessive internal reasoning in templates, reward hacking between hit and IoU objectives, and GRPO-induced biases toward easier samples. It proposes three targeted remedies—a Fast Thinking Template, a box-size reward constraint, and a difficulty-weighted GRPO objective with no length normalization—and demonstrates their effectiveness in a tailored GUI-G1-3B model trained on only 17K public samples. GUI-G1-3B achieves state-of-the-art grounding on ScreenSpot and ScreenSpot-Pro, outperforming larger and prior R1-style agents while using far fewer tokens and training stages. The findings highlight the importance of task-specific RL design for GUI grounding and pave the way for efficient, data-sparing multimodal grounding systems in real-world GUI interaction scenarios.

Abstract

Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

TL;DR

Abstract

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)