Table of Contents
Fetching ...

COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation

Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Zhao Lv

TL;DR

COLA presents a scalable, collaborative multi-agent framework for Windows UI task automation that dynamically assigns subtasks to a pool of specialized decision agents through a planner and task scheduler, with memory modules for self-evolution and an interactive backtracking mechanism for non-destructive repairs. By decomposing complex tasks into scenario-specific subtasks and leveraging a plug-and-play agent pool, COLA achieves state-of-the-art performance on the GAIA benchmark, significantly outperforming baselines, particularly on more challenging levels. Key contributions include a hierarchical five-role architecture, distinct long- and short-term memory for agents, and an interactive rollback capability that enables human intervention without restarting workflows. The framework demonstrates strong potential for scalable, flexible automation of Windows tasks, with practical implications for AI-assisted desktop workflows and future extensions to more complex or safety-critical environments.

Abstract

With the rapid advancements in Large Language Models (LLMs), an increasing number of studies have leveraged LLMs as the cognitive core of agents to address complex task decision-making challenges. Specially, recent research has demonstrated the potential of LLM-based agents on automating Windows GUI operations. However, existing methodologies exhibit two critical challenges: (1) static agent architectures fail to dynamically adapt to the heterogeneous requirements of OS-level tasks, leading to inadequate scenario generalization;(2) the agent workflows lack fault tolerance mechanism, necessitating complete process re-execution for UI agent decision error. To address these limitations, we introduce \textit{COLA}, a collaborative multi-agent framework for automating Windows UI operations. In this framework, a scenario-aware agent Task Scheduler decomposes task requirements into atomic capability units, dynamically selects the optimal agent from a decision agent pool, effectively responds to the capability requirements of diverse scenarios. The decision agent pool supports plug-and-play expansion for enhanced flexibility. In addition, we design a memory unit equipped to all agents for their self-evolution. Furthermore, we develop an interactive backtracking mechanism that enables human to intervene to trigger state rollbacks for non-destructive process repair. Our experimental results on the GAIA benchmark demonstrates that the \textit{COLA} framework achieves state-of-the-art performance with an average score of 31.89\%, significantly outperforming baseline approaches without web API integration. Ablation studies further validate the individual contributions of our dynamic scheduling. The code is available at https://github.com/Alokia/COLA-demo.

COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation

TL;DR

COLA presents a scalable, collaborative multi-agent framework for Windows UI task automation that dynamically assigns subtasks to a pool of specialized decision agents through a planner and task scheduler, with memory modules for self-evolution and an interactive backtracking mechanism for non-destructive repairs. By decomposing complex tasks into scenario-specific subtasks and leveraging a plug-and-play agent pool, COLA achieves state-of-the-art performance on the GAIA benchmark, significantly outperforming baselines, particularly on more challenging levels. Key contributions include a hierarchical five-role architecture, distinct long- and short-term memory for agents, and an interactive rollback capability that enables human intervention without restarting workflows. The framework demonstrates strong potential for scalable, flexible automation of Windows tasks, with practical implications for AI-assisted desktop workflows and future extensions to more complex or safety-critical environments.

Abstract

With the rapid advancements in Large Language Models (LLMs), an increasing number of studies have leveraged LLMs as the cognitive core of agents to address complex task decision-making challenges. Specially, recent research has demonstrated the potential of LLM-based agents on automating Windows GUI operations. However, existing methodologies exhibit two critical challenges: (1) static agent architectures fail to dynamically adapt to the heterogeneous requirements of OS-level tasks, leading to inadequate scenario generalization;(2) the agent workflows lack fault tolerance mechanism, necessitating complete process re-execution for UI agent decision error. To address these limitations, we introduce \textit{COLA}, a collaborative multi-agent framework for automating Windows UI operations. In this framework, a scenario-aware agent Task Scheduler decomposes task requirements into atomic capability units, dynamically selects the optimal agent from a decision agent pool, effectively responds to the capability requirements of diverse scenarios. The decision agent pool supports plug-and-play expansion for enhanced flexibility. In addition, we design a memory unit equipped to all agents for their self-evolution. Furthermore, we develop an interactive backtracking mechanism that enables human to intervene to trigger state rollbacks for non-destructive process repair. Our experimental results on the GAIA benchmark demonstrates that the \textit{COLA} framework achieves state-of-the-art performance with an average score of 31.89\%, significantly outperforming baseline approaches without web API integration. Ablation studies further validate the individual contributions of our dynamic scheduling. The code is available at https://github.com/Alokia/COLA-demo.

Paper Structure

This paper contains 30 sections, 6 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: An illustration of the COLA multi-agent framework. In the first step, Planner takes request $q$ from user and decomposes it into a sequence of coarse-grained subtasks ($\mathcal{T}_{cg}$). Task Scheduler then dynamically selects optimal decision agents through scenario-aware matching. Selected Decision Agents subsequently perform hierarchical task refinement, utilizing their domain-specific expertise to decompose assigned subtasks into fine-grained subtasks ($\mathcal{T}_{fg}$), giving an atomic action $O$ and an intention $I$ to execute that action. Executor executes it and obtains the environmental feedback result $R$. Finally, the Reviewer evaluates the success of the action based on the environment $E_t$, $E_{t+1}$ before and after execution, the intention $I$ and the result $R$. The judgment $J$ is then sent back to the selected Decision Agent. This cyclic refinement continues until all subtask requirements are satisfied, with the Task Scheduler orchestrating inter-subtask transitions. Throughout the process, humans can intervene in the workflow at any time, providing guidance to correct the agent's response.
  • Figure 2: A visual perception example for Microsoft Edge with information provided by pywinauto. The raw screenshot, annotated screenshot and interactive controls information make up the visual perception component $P_t$.
  • Figure 3: Number of questions covered for each skill. Each value is reported in GAIA mialon2024gaia.
  • Figure 4: A comparison between the traditional agent framework and COLA reveals key differences.
  • Figure 5: An abbreviated description of the workflow when COLA performs task "The article ‘Technology in the Dystopian Novel’ by Gorman Beauchamp begins with a block quote attributed to a novelist from the Victorian era. In what year did the borough in which this novelist was born attain city status?"
  • ...and 1 more figures