Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

Yulong Yang; Xinshan Yang; Shuaidong Li; Chenhao Lin; Zhengyu Zhao; Chao Shen; Tianwei Zhang

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

Yulong Yang, Xinshan Yang, Shuaidong Li, Chenhao Lin, Zhengyu Zhao, Chao Shen, Tianwei Zhang

TL;DR

This paper systematically investigates security vulnerabilities in multi-modal mobile GUI agents that integrate LLMs/MLLMs. It introduces a five-step threat-modeling methodology and the SecMoba framework to construct and evaluate 34 previously unreported attacks, validated through real-world case studies and large-scale AITW-based evaluations. The work reveals notable weaknesses across perception, reasoning, memory, and collaboration modules, with nuanced effects from long-term memory and multi-agent collaboration on attack success. The findings highlight substantial security gaps and provide concrete defense directions, stressing the need for defense-aware mobile GUI designs and responsible disclosure practices to protect users in real-world deployments.

Abstract

The integration of Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) into mobile GUI agents has significantly enhanced user efficiency and experience. However, this advancement also introduces potential security vulnerabilities that have yet to be thoroughly explored. In this paper, we present a systematic security investigation of multi-modal mobile GUI agents, addressing this critical gap in the existing literature. Our contributions are twofold: (1) we propose a novel threat modeling methodology, leading to the discovery and feasibility analysis of 34 previously unreported attacks, and (2) we design an attack framework to systematically construct and evaluate these threats. Through a combination of real-world case studies and extensive dataset-driven experiments, we validate the severity and practicality of those attacks, highlighting the pressing need for robust security measures in mobile GUI systems.

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 28 sections, 6 equations, 9 figures, 7 tables, 2 algorithms.

Introduction
Preliminary of Mobile GUI Agents
Threat Modeling of Mobile GUI Agents
Attack Vector Categorization
Vulnerable Asset Identification
Attack Consequence Recognition
Attack Path Construction
Attack Feasibility Analysis
$\mathtt{SecMoba}$ Framework
Attack Payload Generation
Attack Evaluation
Case Study on Real-World Agents
Manipulating User's App Preference
Hijacking User's Purchasing Decision
DoS via Injecting Distracting Information
...and 13 more sections

Figures (9)

Figure 1: An illustration of possible attack paths in multi-modal mobile GUI agents.
Figure 2: Overview of $\mathtt{SecMoba}$ for constructing and evaluating different attacks against multi-modal mobile GUI agents.
Figure 3: Attack case 1: attacking the perception module of the agent to manipulate the user's app preference.
Figure 4: Attack case 2: attacking the reasoning module of the agent to hijack the user's purchasing decision.
Figure 5: Attack case 3: injecting distracting information on the wallpaper to achieve DoS against the agent.
...and 4 more figures

Theorems & Definitions (2)

Definition 1: Multi-modal mobile GUI agent
Definition 2: Adversary from the untrusted environment

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

TL;DR

Abstract

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (2)