Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song; Yiwen Song; Palash Goyal; Yu Su; Oriana Riva; Hamid Palangi; Tomas Pfister

Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister

TL;DR

CUAs require diverse, scalable training data to plan across varied interfaces. Watch & Learn (W&L) tackles this by learning an inverse dynamics model from screen transitions and turning web tutorial videos into executable UI trajectories, which are then used as in-context exemplars and supervised data. The approach yields over 53k trajectories across multiple OSs and shows consistent gains on OSWorld and WindowsAgentArena, outperforming prior labeling pipelines. These results demonstrate that web-scale human demonstrations can meaningfully advance real-world computer-using agents and support both inference-time planning and offline fine-tuning.

Abstract

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.

Watch and Learn: Learning to Use Computers from Online Videos

TL;DR

Abstract

Watch and Learn: Learning to Use Computers from Online Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)