Table of Contents
Fetching ...

Grounding Computer Use Agents on Human Demonstrations

Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

TL;DR

This work introduces GroundCUA, a large-scale, expert-annotated desktop grounding dataset spanning 87 applications, 56K screenshots, and over 3.56 million UI element annotations to enable robust grounding of natural language to on-screen elements. It also presents GroundNext, a two-stage model (SFT on 700K curated samples followed by RL post-training) that achieves state-of-the-art grounding on multiple desktop benchmarks with far less data than prior methods. GroundNext demonstrates strong agentic performance in realistic task settings and exhibits cross-domain generalization to mobile and web interfaces, benefiting from high-quality desktop-focused data. The study emphasizes data quality and dense, context-rich annotations as key factors driving grounding performance, and releases both GroundCUA and GroundNext to advance research in end-to-end computer-use agents across platforms.

Abstract

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

Grounding Computer Use Agents on Human Demonstrations

TL;DR

This work introduces GroundCUA, a large-scale, expert-annotated desktop grounding dataset spanning 87 applications, 56K screenshots, and over 3.56 million UI element annotations to enable robust grounding of natural language to on-screen elements. It also presents GroundNext, a two-stage model (SFT on 700K curated samples followed by RL post-training) that achieves state-of-the-art grounding on multiple desktop benchmarks with far less data than prior methods. GroundNext demonstrates strong agentic performance in realistic task settings and exhibits cross-domain generalization to mobile and web interfaces, benefiting from high-quality desktop-focused data. The study emphasizes data quality and dense, context-rich annotations as key factors driving grounding performance, and releases both GroundCUA and GroundNext to advance research in end-to-end computer-use agents across platforms.

Abstract

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

Paper Structure

This paper contains 58 sections, 7 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Overview of the GroundCUA dataset and GroundNext models. Human demonstrations of computer-use tasks are recorded as screenshots (example from FreeCAD) with UI metadata, which are processed into high-quality natural language instruction tasks for UI grounding. GroundNext is trained in two stages: SFT (700K samples) followed by RL (10K samples), achieving state-of-the-art grounding performance with efficient training.
  • Figure 2: Examples of screenshots from different applications in GroundCUA. Red bounding boxes indicate the annotated UI elements within each screenshot.
  • Figure 3: Mean SFT scores (orange) across benchmarks, with RL gains from $10$k GroundCUA samples shown in blue.
  • Figure 4: Dataset Statistics
  • Figure 5: Comparison across different datasets. (Left) Pixel distribution for different datasets. (Right) Relative bounding box area in log scale.
  • ...and 3 more figures