Table of Contents
Fetching ...

Efficient Agent Training for Computer Use

Yanheng He, Jiahe Jin, Pengfei Liu

TL;DR

PC Agent-E tackles the data bottleneck in building human-like computer-use agents by starting from a small set of authentic Windows trajectories and enriching them with AI-driven, diverse action decisions via Trajectory Boost. The approach combines Thought Completion with a simple end-to-end ReAct-style training scaffold to produce a high-quality, data-efficient agent trained on 27k augmented instances, achieving a 141% improvement over a strong baseline and surpassing Claude 3.7 Sonnet with thinking on WindowsAgentArena-V2, while demonstrating cross-platform generalization to OSWorld. The creation of WindowsAgentArena-V2 addresses evaluation pitfalls and infeasibilities, enabling fair comparisons. Overall, the work demonstrates that strong computer-use capabilities can be elicited from a compact, high-quality trajectory dataset, highlighting the promise of data-efficient native agent training and paving the way for RL and SFT collaboration in long-horizon GUI tasks.

Abstract

Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.

Efficient Agent Training for Computer Use

TL;DR

PC Agent-E tackles the data bottleneck in building human-like computer-use agents by starting from a small set of authentic Windows trajectories and enriching them with AI-driven, diverse action decisions via Trajectory Boost. The approach combines Thought Completion with a simple end-to-end ReAct-style training scaffold to produce a high-quality, data-efficient agent trained on 27k augmented instances, achieving a 141% improvement over a strong baseline and surpassing Claude 3.7 Sonnet with thinking on WindowsAgentArena-V2, while demonstrating cross-platform generalization to OSWorld. The creation of WindowsAgentArena-V2 addresses evaluation pitfalls and infeasibilities, enabling fair comparisons. Overall, the work demonstrates that strong computer-use capabilities can be elicited from a compact, high-quality trajectory dataset, highlighting the promise of data-efficient native agent training and paving the way for RL and SFT collaboration in long-horizon GUI tasks.

Abstract

Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.

Paper Structure

This paper contains 39 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: PC Agent-E achieves state-of-the-art open-source performance in Windows computer use with just 312 augmented trajectories.
  • Figure 2: Overview of our framework, consisting of four key components: (1) Trajectory Collection, gathering a small set of human trajectories by recording user actions and state observations at each step; (2) Thought Completion, reconstructing the implicit thought process missing in raw human trajectories; and (3) Trajectory Boost, diversifying action decisions to further enhance trajectory quality (4) Agent Training, developing a strong computer use agent with remarkable data efficiency.
  • Figure 3: An example trajectory collected by PC Tracker.
  • Figure 4: Distribution of the 312 task trajectories across different applications.
  • Figure 5: Visualization of our Trajectory Boost method. (Left) Raw human trajectory recorded by PC Tracker. (Center) Human trajectory with reconstructed thoughts after Thought Completion, where the red node indicates human action decisions. (Right) The final Traj Tree, where the blue node indicates augmented diverse action decisions synthesized by Claude 3.7 Sonnet.
  • ...and 3 more figures