Table of Contents
Fetching ...

TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

Deyang Jiang, Jing Huang, Xuanle Zhao, Lei Chen, Liming Zheng, Fanfan Liu, Haibo Qiu, Peng Shi, Zhixiong Zeng

TL;DR

TreeCUA tackles the challenge of scaling GUI automation by organizing long-horizon planning trajectories into a tree and employing verifiable evolution. It introduces a multi-agent framework, world knowledge initialization, adaptive tree exploration, step verification, and global memory to produce high-quality, diverse trajectories; it extends TreeCUA with TreeCUA-DPO to leverage branch information for improved planning. The authors construct a two-stage SFT training pipeline and demonstrate state-of-the-art performance on OSWorld and strong generalization to out-of-domain tasks, with a public dataset release. Overall, the approach reduces data costs while enhancing GUI planning capabilities for CUAs across desktop, web, and OS environments.

Abstract

Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron-hub/TreeCUA.

TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

TL;DR

TreeCUA tackles the challenge of scaling GUI automation by organizing long-horizon planning trajectories into a tree and employing verifiable evolution. It introduces a multi-agent framework, world knowledge initialization, adaptive tree exploration, step verification, and global memory to produce high-quality, diverse trajectories; it extends TreeCUA with TreeCUA-DPO to leverage branch information for improved planning. The authors construct a two-stage SFT training pipeline and demonstrate state-of-the-art performance on OSWorld and strong generalization to out-of-domain tasks, with a public dataset release. Overall, the approach reduces data costs while enhancing GUI planning capabilities for CUAs across desktop, web, and OS environments.

Abstract

Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron-hub/TreeCUA.
Paper Structure (44 sections, 3 equations, 7 figures, 7 tables)

This paper contains 44 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the tree-structured verifiable evolution for scalable GUI trajectory synthesis. Our strategy could be divided into online concurrent exploration and offline post-processing and improvement phases.
  • Figure 2: Analysis of exploration strategy. (a) The distribution of average branching factor across depths. A node with $n$ branches has a branching factor of $n-1$. (b) Efficiency comparison benchmarking tree-structured exploration (with and without node reuse) against linear baselines.
  • Figure 3: Impact of World Knowledge (WK) on Exploration Diversity in VS Code. (a) Semantic Task Discovery: Cumulative count of unique tasks, defined by a TF-IDF cosine similarity threshold of $< 0.65$. (b) Lexical Diversity: Type-Token Ratio (TTR) averaged over 20 random samples of 500 step-goals.
  • Figure 4: Analysis on Inter-Tree Action Redundancy. Analysis of action overlap between trees within the same setting. Redundancy is quantified via pairwise Jaccard similarity, matching actions by type and grid-quantized coordinates to mitigate pixel noise.
  • Figure 5: Comparison of reasoning quality between TreeCUA and Claude based on ROSCOE metrics.
  • ...and 2 more figures