Table of Contents
Fetching ...

Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister

TL;DR

CUAs require diverse, scalable training data to plan across varied interfaces. Watch & Learn (W&L) tackles this by learning an inverse dynamics model from screen transitions and turning web tutorial videos into executable UI trajectories, which are then used as in-context exemplars and supervised data. The approach yields over 53k trajectories across multiple OSs and shows consistent gains on OSWorld and WindowsAgentArena, outperforming prior labeling pipelines. These results demonstrate that web-scale human demonstrations can meaningfully advance real-world computer-using agents and support both inference-time planning and offline fine-tuning.

Abstract

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.

Watch and Learn: Learning to Use Computers from Online Videos

TL;DR

CUAs require diverse, scalable training data to plan across varied interfaces. Watch & Learn (W&L) tackles this by learning an inverse dynamics model from screen transitions and turning web tutorial videos into executable UI trajectories, which are then used as in-context exemplars and supervised data. The approach yields over 53k trajectories across multiple OSs and shows consistent gains on OSWorld and WindowsAgentArena, outperforming prior labeling pipelines. These results demonstrate that web-scale human demonstrations can meaningfully advance real-world computer-using agents and support both inference-time planning and offline fine-tuning.

Abstract

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.

Paper Structure

This paper contains 42 sections, 7 equations, 3 figures, 27 tables.

Figures (3)

  • Figure 1: W&L converts web-scale human demonstration videos into executable UI trajectories, providing scalable supervision and in-context exemplars for computer-using agents.
  • Figure 2: Method overview. Our framework converts web-scale human demonstration videos into executable trajectories for CUAs. We first collect a large-scale state-transition dataset of screen observations and user actions, and train an inverse dynamics model (IDM) to recover actions from consecutive screenshots. This IDM is then applied to tutorial videos to extract step-by-step trajectories. A retrieval module selects task-relevant or general demonstrations, which are used in two ways: (i) as in-context exemplars that provide application-specific knowledge at inference time, and (ii) as supervised training data to improve open-source CUAs.
  • Figure 3: Qualitative examples on OSWorld. On the left, the video-derived trajectory that W&L generates for the task. On the right: (i) the o3 agent makes a grounding error by selecting a wrong UI element; (ii) the Jedi (o3) agent makes a planning error by entering the wrong submenu without recovering; (iii) using the video-derived trajectory, W&L agent completes the task successfully. Images are cropped for visibility, and the action coordinates correspond to the original full-resolution screenshots. More trajectory examples are in Appendix.