Table of Contents
Fetching ...

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu

TL;DR

This work critically reexamines R1-Zero-Like training for GUI grounding, identifying three core challenges: excessive internal reasoning in templates, reward hacking between hit and IoU objectives, and GRPO-induced biases toward easier samples. It proposes three targeted remedies—a Fast Thinking Template, a box-size reward constraint, and a difficulty-weighted GRPO objective with no length normalization—and demonstrates their effectiveness in a tailored GUI-G1-3B model trained on only 17K public samples. GUI-G1-3B achieves state-of-the-art grounding on ScreenSpot and ScreenSpot-Pro, outperforming larger and prior R1-style agents while using far fewer tokens and training stages. The findings highlight the importance of task-specific RL design for GUI grounding and pave the way for efficient, data-sparing multimodal grounding systems in real-world GUI interaction scenarios.

Abstract

Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

TL;DR

This work critically reexamines R1-Zero-Like training for GUI grounding, identifying three core challenges: excessive internal reasoning in templates, reward hacking between hit and IoU objectives, and GRPO-induced biases toward easier samples. It proposes three targeted remedies—a Fast Thinking Template, a box-size reward constraint, and a difficulty-weighted GRPO objective with no length normalization—and demonstrates their effectiveness in a tailored GUI-G1-3B model trained on only 17K public samples. GUI-G1-3B achieves state-of-the-art grounding on ScreenSpot and ScreenSpot-Pro, outperforming larger and prior R1-style agents while using far fewer tokens and training stages. The findings highlight the importance of task-specific RL design for GUI grounding and pave the way for efficient, data-sparing multimodal grounding systems in real-world GUI interaction scenarios.

Abstract

Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.

Paper Structure

This paper contains 22 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: This framework employs the GRPO algorithm for optimization, emphasizing three critical components: input design, output evaluation, and policy update.
  • Figure 2: (Left) shows the grounding accuracy under varying numbers of output tokens and image tokens. "Text" refers to cases where the target is a textual element, while "Icon" refers to image targets. (Right) presents the grounding accuracy on the Text and Icon subsets across different image sizes. Within each group, samples are evenly divided based on their text ratio.
  • Figure 3: Changes in accuracy (left), IoU (middle), and relative box size (right) across policy iterations during model training on the ScreenSpot dataset.
  • Figure 4: (Left) Two cases with predicted bounding boxes and golden-truth boxes. (Right) Two examples illustrating why $R_{\text{IoU}}$ favors larger boxes, while $R_{\text{Hit}}$ prefers smaller ones.
  • Figure 5: Illustration of the response-level length biases and query-level difficulty biases in GRPO.
  • ...and 1 more figures