Table of Contents
Fetching ...

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu

TL;DR

The paper surveys OS Agents, defined as (M)LLM-powered agents that operate within operating systems to automate GUI-driven tasks. It analyzes fundamentals including environment observation and action spaces, essential capabilities, and the construction of OS Agents via architecture choices and domain-specific training, along with RL-based approaches. It reviews agent frameworks focusing on perception, planning, memory, and action, and details evaluation protocols and benchmarks spanning mobile, desktop, and web platforms. It also discusses critical challenges such as safety and privacy and the potential for personalization and self-evolution, offering directions and open-source resources to accelerate practical deployment in academia and industry.

Abstract

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

TL;DR

The paper surveys OS Agents, defined as (M)LLM-powered agents that operate within operating systems to automate GUI-driven tasks. It analyzes fundamentals including environment observation and action spaces, essential capabilities, and the construction of OS Agents via architecture choices and domain-specific training, along with RL-based approaches. It reviews agent frameworks focusing on perception, planning, memory, and action, and details evaluation protocols and benchmarks spanning mobile, desktop, and web platforms. It also discusses critical challenges such as safety and privacy and the potential for personalization and self-evolution, offering directions and open-source resources to accelerate practical deployment in academia and industry.

Abstract

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

Paper Structure

This paper contains 31 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Fundamentals of OS Agents.
  • Figure 2: Summary of the content about foundation models for OS Agents in §\ref{['sec:foundation_models']}.
  • Figure 3: Summary of the content about agent frameworks for OS Agents in §\ref{['sec:agent_frameworks']}.