Table of Contents
Fetching ...

IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim

TL;DR

IntentCUA tackles drift and inefficiency in long-horizon desktop automation by learning multi-view intent representations from interaction traces and organizing them into reusable skills stored in a memory-augmented planner. It combines a trace-to-intent pipeline, hierarchical IG/SG skill prototypes, and a Planner–Plan-Optimizer–Critic loop to stabilize execution and reduce re-planning. End-to-end evaluations show 74.83% task success and SER 0.91, with ablations highlighting the additive value of multi-view abstraction and memory-grounded coordination, and strong generalization across web, local, and cross-application tasks. The work demonstrates that intent-level abstraction coupled with memory-guided cooperation can enable reliable, scalable desktop automation in complex, dynamic environments.

Abstract

Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.

IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

TL;DR

IntentCUA tackles drift and inefficiency in long-horizon desktop automation by learning multi-view intent representations from interaction traces and organizing them into reusable skills stored in a memory-augmented planner. It combines a trace-to-intent pipeline, hierarchical IG/SG skill prototypes, and a Planner–Plan-Optimizer–Critic loop to stabilize execution and reduce re-planning. End-to-end evaluations show 74.83% task success and SER 0.91, with ablations highlighting the additive value of multi-view abstraction and memory-grounded coordination, and strong generalization across web, local, and cross-application tasks. The work demonstrates that intent-level abstraction coupled with memory-guided cooperation can enable reliable, scalable desktop automation in complex, dynamic environments.

Abstract

Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
Paper Structure (29 sections, 10 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of IntentCUA. Offline: raw user traces are multi-view labeled, embedded into a shared intent space, and clustered into intent groups (IG) and subgroups (SG); SG action patterns are converted into parameterized skill schemas (“skill hints”) and stored with their SG in the IG/SG index, while plan memory stores only user-approved global plans (G). Online: the Planner/Plan-Optimizer/Critic query and reuse skills; cache-first reuse and template-based gap filling reduce re-planning on long-horizon desktop tasks.
  • Figure 2: Multi-view intent representation. Control traces use [E,A,D], browsing traces [E,K,D]. A multi-view encoder aligns views into a shared space, inducing environment-centric IG and finer SG. SG centroids enable retrieval, and SG action patterns are converted into parameterized skill schemas (“skill hints”) with verb–argument structure for planning.
  • Figure 3: Cache-first planning with plan memory. A query intent is gated by IG and ranked over SG. Case 1 (miss): synthesize a plan from retrieved skill templates. Case 2 (hit): reuse the stored plan. Case 3 (partial): align to the nearest plan and fill gaps with SG-derived skill hints, reducing retries.
  • Figure 4: IntentCUA in action: the system recalls intent units from memory and decomposes a multi-application command into intent-level plan units, each executed through learned skills and recomposed into an end-to-end automation plan.
  • Figure 5: Success rate by step length (bin size = 5 steps). The x-axis shows step-length bins and the y-axis shows task success rate (%).
  • ...and 4 more figures