Table of Contents
Fetching ...

UFO2: The Desktop AgentOS

Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

TL;DR

UFO2 tackles the fragility and inefficiency of traditional desktop automation by introducing a deeply integrated Windows AgentOS. It centralizes task planning in a HostAgent and delegates application-specific control to AppAgents, leveraging a hybrid GUI–API action layer, a hybrid control detector, and a continuous knowledge substrate to learn from experience. The PiP interface isolates automation from the main user session, enabling concurrent work without disruption. Empirical evaluation across 20+ real-world Windows apps shows UFO2 outperforms state-of-the-art CUAs in robustness and efficiency, with notable gains from speculative multi-action execution and API-enabled actions. Overall, the work demonstrates the practical viability and scalability of OS-level automation as a programmable, modular subsystem of the desktop environment.

Abstract

Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

UFO2: The Desktop AgentOS

TL;DR

UFO2 tackles the fragility and inefficiency of traditional desktop automation by introducing a deeply integrated Windows AgentOS. It centralizes task planning in a HostAgent and delegates application-specific control to AppAgents, leveraging a hybrid GUI–API action layer, a hybrid control detector, and a continuous knowledge substrate to learn from experience. The PiP interface isolates automation from the main user session, enabling concurrent work without disruption. Empirical evaluation across 20+ real-world Windows apps shows UFO2 outperforms state-of-the-art CUAs in robustness and efficiency, with notable gains from speculative multi-action execution and API-enabled actions. Overall, the work demonstrates the practical viability and scalability of OS-level automation as a programmable, modular subsystem of the desktop environment.

Abstract

Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

Paper Structure

This paper contains 75 sections, 25 figures, 8 tables, 1 algorithm.

Figures (25)

  • Figure 1: A comparison of (a) existing CUAs and (b) desktop AgentOS UFO2.
  • Figure 2: An overview of the architecture of UFO2.
  • Figure 3: High-level architecture of HostAgent as a control-plane orchestrator.
  • Figure 4: Control-state transitions managed by HostAgent.
  • Figure 5: Architecture of an AppAgent, the per-application execution runtime in UFO2.
  • ...and 20 more figures