IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents
Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim
TL;DR
IntentCUA tackles drift and inefficiency in long-horizon desktop automation by learning multi-view intent representations from interaction traces and organizing them into reusable skills stored in a memory-augmented planner. It combines a trace-to-intent pipeline, hierarchical IG/SG skill prototypes, and a Planner–Plan-Optimizer–Critic loop to stabilize execution and reduce re-planning. End-to-end evaluations show 74.83% task success and SER 0.91, with ablations highlighting the additive value of multi-view abstraction and memory-grounded coordination, and strong generalization across web, local, and cross-application tasks. The work demonstrates that intent-level abstraction coupled with memory-guided cooperation can enable reliable, scalable desktop automation in complex, dynamic environments.
Abstract
Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
