ANCHOR: Branch-Point Data Generation for GUI Agents
Jinbiao Wei, Yilun Zhao, Kangqi Ni, Arman Cohan
TL;DR
Anchor introduces a scalable trajectory expansion framework for desktop GUI agents by branching from verified seed demonstrations at meaningful GUI state changes to generate diverse, long-horizon, state-grounded supervision. The pipeline includes seed collection, branch-point discovery, task proposal, rollout execution with verification, plus step-level filtering and denoising to maintain coherence. Empirical results on OSWorld and WindowsAgentArena show consistent improvements across multiple backbones (GLM-4.1V-9B, Qwen2.5-VL-7B, Qwen3-VL-8B) and cross-platform settings, outperforming zero-shot and task-driven baselines and demonstrating cross-domain generalization. The work demonstrates that branching-based data generation can provide高-quality, high-signal supervision while reducing reliance on perfect exploration or extensive human demonstrations, enabling scalable GUI automation across apps and OSs.
Abstract
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data, yet collecting human demonstrations is expensive and existing synthetic pipelines often suffer from limited task diversity or noisy, goal-drifting trajectories. We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations. Starting from each seed, we identify branch points that correspond to meaningful state changes and propose new, state-grounded task variants conditioned on the current GUI context. An executing agent then follows the proposed instructions to generate new trajectories, while a verifier enforces task completion via state-aware checks and trajectory-level consistency. To improve supervision quality, we further apply task-conditioned step-level filtering to remove ungrounded actions and denoise post-branch segments to maintain coherent intent. Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements over zero-shot agents and representative synthesis baselines, and generalize across applications and operating systems.
