Table of Contents
Fetching ...

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu

TL;DR

PC Agent advances the goal of truly capable digital agents by introducing cognition transfer: capturing rich human-computer interaction trajectories with PC Tracker, converting them into cognitive trajectories through a two-stage cognition completion pipeline, and training a multi-agent system that combines planning with robust visual grounding. In PowerPoint presentation tasks, the framework achieves notable data efficiency, with 133 trajectories enabling up to 50-step work, illustrating the value of learning from human cognition rather than only behavior. The work emphasizes open-source release of the data collection and cognition-completion stack to accelerate research and development of practical AI agents capable of cross-application complex work. Overall, this approach moves toward scalable, robust digital agents that can meaningfully reduce human workload in office-like digital tasks.

Abstract

Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

TL;DR

PC Agent advances the goal of truly capable digital agents by introducing cognition transfer: capturing rich human-computer interaction trajectories with PC Tracker, converting them into cognitive trajectories through a two-stage cognition completion pipeline, and training a multi-agent system that combines planning with robust visual grounding. In PowerPoint presentation tasks, the framework achieves notable data efficiency, with 133 trajectories enabling up to 50-step work, illustrating the value of learning from human cognition rather than only behavior. The work emphasizes open-source release of the data collection and cognition-completion stack to accelerate research and development of practical AI agents capable of cross-application complex work. Overall, this approach moves toward scalable, robust digital agents that can meaningfully reduce human workload in office-like digital tasks.

Abstract

Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

Paper Structure

This paper contains 66 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 2: Key features of PC Tracker
  • Figure 3: An example trajectory collected by PC Tracker. Red marks on the screenshots indicate the positions of click-related actions.
  • Figure 4: Action space $\mathcal{A}$ of PC Tracker.
  • Figure 5: Example of type encapsulation.
  • Figure 8: An overview of the dual-mode collection design
  • ...and 7 more figures