Table of Contents
Fetching ...

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai

TL;DR

The paper tackles the high cost and limited adaptability of offline GUI agent training by introducing ZeroGUI, a fully automated online learning framework that uses vision-language models to generate training tasks and estimate rewards. It combines a two-stage reinforcement learning loop—training on generated tasks followed by test-time adaptation—with novel GRPO-based optimization using a stable per-token KL loss (k2-KL). Key contributions include automatic task generation strategies, a voting-based VLM reward estimator, and empirical validation showing significant improvements on OSWorld and AndroidLab over strong baselines. The work reduces human supervision in GUI agent training and enhances generalization to dynamic, interactive environments across desktop and mobile platforms.

Abstract

The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

TL;DR

The paper tackles the high cost and limited adaptability of offline GUI agent training by introducing ZeroGUI, a fully automated online learning framework that uses vision-language models to generate training tasks and estimate rewards. It combines a two-stage reinforcement learning loop—training on generated tasks followed by test-time adaptation—with novel GRPO-based optimization using a stable per-token KL loss (k2-KL). Key contributions include automatic task generation strategies, a voting-based VLM reward estimator, and empirical validation showing significant improvements on OSWorld and AndroidLab over strong baselines. The work reduces human supervision in GUI agent training and enhances generalization to dynamic, interactive environments across desktop and mobile platforms.

Abstract

The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

Paper Structure

This paper contains 25 sections, 14 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Left: Existing Offline Training Framework for GUI Agents incurs high human costs, relying on manually collected and annotated interaction trajectories, typically under a supervised fine-tuning (SFT) paradigm. Right: Our ZeroGUI is a scalable online learning framework with automated task generation and reward estimation at zero human cost. A VLM proposes diverse tasks, which are executed by the agent; the agent then receives VLM-based rewards and updates its policy via reinforcement learning (RL).
  • Figure 2: Top: Overview of ZeroGUI. It adopts a Two-stage Online Reinforcement Learning paradigm. In the first stage, tasks are automatically generated by a VLM, while in the second stage, tasks are drawn from the test set. These tasks are executed by the GUI agent. After each interaction, a reward is assigned automatically by the VLM based on the agent's trajectory, and the policy network is updated via reinforcement learning. Bottom left: Automatic Task Generation. The VLM receives a random initial screenshot and a set of task exemplars to generate diverse novel tasks. Bottom right: Automatic Reward Estimation. The final reward is obtained via majority voting of multiple VLM evaluations based on all screenshots of the trajectory.
  • Figure 3: Comparison of training accuracies with k3-KL (GRPO) and k2-KL (ours).
  • Figure 4: Gradient coefficient of KL loss.
  • Figure 5: KL loss curve and token-wise maximum and minimum of $\log\pi_\theta-\log\pi_{\text{ref}}$ during training.
  • ...and 8 more figures