Table of Contents
Fetching ...

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, Yuan Huang, Yiqian Liu, Weijie Su, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

TL;DR

ScaleCUA addresses the data bottleneck in open-source computer-use agents by building ScaleCUA-Data, a large cross-platform GUI corpus spanning Windows, macOS, Linux, Android, iOS, and Web. It introduces ScaleCUA, a family of base agents with three inference paradigms (Grounding, Direct Action, Reasoned Action) and a unified action space, trained on a mix of GUI-specific and general multimodal data. The dual-loop data pipeline combines automated agent exploration with human annotations to yield richly labeled datasets for GUI understanding, grounding, and task completion. Empirical results across GUI benchmarks (MMBench-GUI, ScreenSpot, OSWorld) show state-of-the-art or competitive performance, underscoring the value of data-driven scaling for general-purpose CUAs and enabling open research through released data, models, and code.

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

TL;DR

ScaleCUA addresses the data bottleneck in open-source computer-use agents by building ScaleCUA-Data, a large cross-platform GUI corpus spanning Windows, macOS, Linux, Android, iOS, and Web. It introduces ScaleCUA, a family of base agents with three inference paradigms (Grounding, Direct Action, Reasoned Action) and a unified action space, trained on a mix of GUI-specific and general multimodal data. The dual-loop data pipeline combines automated agent exploration with human annotations to yield richly labeled datasets for GUI understanding, grounding, and task completion. Empirical results across GUI benchmarks (MMBench-GUI, ScreenSpot, OSWorld) show state-of-the-art or competitive performance, underscoring the value of data-driven scaling for general-purpose CUAs and enabling open research through released data, models, and code.

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

Paper Structure

This paper contains 50 sections, 1 equation, 33 figures, 23 tables.

Figures (33)

  • Figure 1: Performance comparison. The top row showcases performance overview on GUI-centric benchmarks. The bottom row demonstrates the consistent improvements from our collected data.
  • Figure 2: Cross-Platform Interactive Data Pipeline. Our Pipeline consists of two synergistic loops: (1) the Agent-Environment Interaction Loop, where agents interact with multi-platform GUI environments (including desktop, mobile, and web) via observation and action; and (2) the Agent-Human Hybrid Data Acquisition Loop, where both autonomous agents and human experts contribute to collecting raw trajectories, including screenshots and structural metadata. The resulting trajectories are then annotated and transformed into training corpora for tasks such as GUI understanding, GUI grounding, and sequential action modeling.
  • Figure 3: Data distribution of our dataset.
  • Figure 4: Three Inference Paradigms of Our Computer Use Agents: (1) Grounding Mode, which focuses on identifying target UI elements with their spatial coordinates and bounding boxes; (2) Direct Action Mode, where the agent solely generates executable actions based on current observations and instructions; and (3) Reasoned Action Mode, where the agent first generates a chain-of-thought rationale before producing structured actions. These modes enable varying levels of functionality for computer use agents to complete tasks.
  • Figure 5: Evaluations across diverse conditions. (a) Accuracy of GUI grounding under different screenshot resolutions. (b) Success rates of Direct Action vs. Reasoned Action Modes, where reasoning consistently improves performance. (c) Training data scaling. (d) Effect of using general data, showing distinct trends between GUI and multimodal benchmarks.
  • ...and 28 more figures