Table of Contents
Fetching ...

A Survey on (M)LLM-Based GUI Agents

Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang

TL;DR

This survey addresses the rapid emergence of MLLM-based GUI Agents by decomposing their architecture into perception, exploration, planning, and interaction, and by classifying approaches into general/trained/multi-agent/LLM-based paradigms. It synthesizes methods for multimodal perception, knowledge acquisition and memory, advanced reasoning (CoT, ToT, ReAct, graph/Monte Carlo tree search), and action grounding with diverse grounding strategies and action spaces. The paper critically analyzes existing datasets and benchmarks across static, simulated, and real-world environments, highlighting cost, diversity, and security challenges, and proposes directions for standardized evaluation and higher-fidelity perception and planning. It also discusses domain-specific applications (mobile, desktop, web, games) and outlines future RL-based improvements to enable safer, more robust long-horizon GUI automation with improved multimodal grounding and planning. The work provides a comprehensive roadmap for researchers and practitioners aiming to advance intelligent, end-to-end GUI automation across platforms.

Abstract

Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation.

A Survey on (M)LLM-Based GUI Agents

TL;DR

This survey addresses the rapid emergence of MLLM-based GUI Agents by decomposing their architecture into perception, exploration, planning, and interaction, and by classifying approaches into general/trained/multi-agent/LLM-based paradigms. It synthesizes methods for multimodal perception, knowledge acquisition and memory, advanced reasoning (CoT, ToT, ReAct, graph/Monte Carlo tree search), and action grounding with diverse grounding strategies and action spaces. The paper critically analyzes existing datasets and benchmarks across static, simulated, and real-world environments, highlighting cost, diversity, and security challenges, and proposes directions for standardized evaluation and higher-fidelity perception and planning. It also discusses domain-specific applications (mobile, desktop, web, games) and outlines future RL-based improvements to enable safer, more robust long-horizon GUI automation with improved multimodal grounding and planning. The work provides a comprehensive roadmap for researchers and practitioners aiming to advance intelligent, end-to-end GUI automation across platforms.

Abstract

Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation.

Paper Structure

This paper contains 57 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of GUI Agents
  • Figure 2: A Comprehensive Taxonomy of GUI Agents Components
  • Figure 3: Comparison between Text-based and MLLM-based GUI Agents. Text-based agents process textual information such as XML or DOM trees, while MLLM-based agents utilize screen screenshots. Some MLLM-based GUI Agents additionally employ specialized tools for element localization, including IconNet, YOLO, Grounding DINO, and OCR.
  • Figure 4: Recent work summary and classification of GUI Agents (2021-2025). The visualization presents the evolution of approaches across four primary categories: General MLLM, Training MLLM, Multi-Agent, and LLM-Based methodologies. The horizontal axis tracks chronological development (Year-Month), while the vertical axis indicates publication volume.
  • Figure 5: Exploration of GUI Agents. The diagram illustrates the knowledge acquisition process through three exploration types (internal, external, and historical), showing how conceptual knowledge is constructed and stored in logical knowledge bases before being retrieved and applied by agents.
  • ...and 4 more figures