Table of Contents
Fetching ...

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang

TL;DR

CUAs struggle with reliability on long-horizon tasks. The authors introduce Behavior Best-of-N (bBoN), a wide-scaling framework that converts trajectories into concise behavior narratives and uses a comparative judge to select the best outcome across multiple rollouts from diverse base agents. They also present an improved baseline Agent S3 with a coding agent and a flat policy to提升 trajectory quality before selection. On OSWorld, bBoN achieves 69.9% at 100 steps, approaching human performance, and generalizes to WindowsAgentArena and AndroidWorld, supported by extensive ablations.

Abstract

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.

The Unreasonable Effectiveness of Scaling Agents for Computer Use

TL;DR

CUAs struggle with reliability on long-horizon tasks. The authors introduce Behavior Best-of-N (bBoN), a wide-scaling framework that converts trajectories into concise behavior narratives and uses a comparative judge to select the best outcome across multiple rollouts from diverse base agents. They also present an improved baseline Agent S3 with a coding agent and a flat policy to提升 trajectory quality before selection. On OSWorld, bBoN achieves 69.9% at 100 steps, approaching human performance, and generalizes to WindowsAgentArena and AndroidWorld, supported by extensive ablations.

Abstract

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.

Paper Structure

This paper contains 32 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Performance on OSWorld at 100 steps. Our method beats the previous SoTA by 10% absolute improvement, nearly reaching human level performance.
  • Figure 2: Disjoint task success across rollouts by three agent instances. Behavior Best-of-N (bBoN) leverages this complementarity by selecting the best trajectory among multiple rollouts.
  • Figure 3: Behavior Best-of-N generates multiple rollouts consisting of screenshots and actions. These trajectories are converted into behavior narratives via the behavior narrative generator, using the executed action and before/after screenshots to describe what was changed. Finally, the behavior narratives are provided to the judge which selects the best trajectory through comparison.
  • Figure 4: Performance of bBoN on OSWorld with increasing number of rollouts.
  • Figure 5: Comparison of bBoN against WebJudge on OSWorld using GPT-5 Mini's rollouts. Average represents the average performance of the rollouts.
  • ...and 2 more figures