Table of Contents
Fetching ...

ANCHOR: Branch-Point Data Generation for GUI Agents

Jinbiao Wei, Yilun Zhao, Kangqi Ni, Arman Cohan

TL;DR

Anchor introduces a scalable trajectory expansion framework for desktop GUI agents by branching from verified seed demonstrations at meaningful GUI state changes to generate diverse, long-horizon, state-grounded supervision. The pipeline includes seed collection, branch-point discovery, task proposal, rollout execution with verification, plus step-level filtering and denoising to maintain coherence. Empirical results on OSWorld and WindowsAgentArena show consistent improvements across multiple backbones (GLM-4.1V-9B, Qwen2.5-VL-7B, Qwen3-VL-8B) and cross-platform settings, outperforming zero-shot and task-driven baselines and demonstrating cross-domain generalization. The work demonstrates that branching-based data generation can provide高-quality, high-signal supervision while reducing reliance on perfect exploration or extensive human demonstrations, enabling scalable GUI automation across apps and OSs.

Abstract

End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.

ANCHOR: Branch-Point Data Generation for GUI Agents

TL;DR

Anchor introduces a scalable trajectory expansion framework for desktop GUI agents by branching from verified seed demonstrations at meaningful GUI state changes to generate diverse, long-horizon, state-grounded supervision. The pipeline includes seed collection, branch-point discovery, task proposal, rollout execution with verification, plus step-level filtering and denoising to maintain coherence. Empirical results on OSWorld and WindowsAgentArena show consistent improvements across multiple backbones (GLM-4.1V-9B, Qwen2.5-VL-7B, Qwen3-VL-8B) and cross-platform settings, outperforming zero-shot and task-driven baselines and demonstrating cross-domain generalization. The work demonstrates that branching-based data generation can provide高-quality, high-signal supervision while reducing reliance on perfect exploration or extensive human demonstrations, enabling scalable GUI automation across apps and OSs.

Abstract

End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.
Paper Structure (49 sections, 2 equations, 11 figures, 6 tables)

This paper contains 49 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Trajectory generation pipeline. (a) Human annotators collect seed trajectories and retain only high-quality demonstrations. (b) We identify branch points along each seed where meaningful UI state changes reveal new affordances. (c) We then expand each seed by branching into diverse task instructions and executing the resulting trajectories. (d) Finally, a task summarizer produces a task description, and a verifier checks task completion.
  • Figure 2: Scaling curve on OSWorld for In-domain data.
  • Figure 3: Scaling curve on OSWorld for Cross-domain data.
  • Figure 4: Prompts used to identify branch states for task diversification.
  • Figure 5: Prompts used for branch trajectory generation: summarizing progress at a branch state (top) and generating diverse follow-up tasks conditioned on the current UI state (bottom).
  • ...and 6 more figures