Table of Contents
Fetching ...

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

Yulong Yang, Xinshan Yang, Shuaidong Li, Chenhao Lin, Zhengyu Zhao, Chao Shen, Tianwei Zhang

TL;DR

This paper systematically investigates security vulnerabilities in multi-modal mobile GUI agents that integrate LLMs/MLLMs. It introduces a five-step threat-modeling methodology and the SecMoba framework to construct and evaluate 34 previously unreported attacks, validated through real-world case studies and large-scale AITW-based evaluations. The work reveals notable weaknesses across perception, reasoning, memory, and collaboration modules, with nuanced effects from long-term memory and multi-agent collaboration on attack success. The findings highlight substantial security gaps and provide concrete defense directions, stressing the need for defense-aware mobile GUI designs and responsible disclosure practices to protect users in real-world deployments.

Abstract

The integration of Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) into mobile GUI agents has significantly enhanced user efficiency and experience. However, this advancement also introduces potential security vulnerabilities that have yet to be thoroughly explored. In this paper, we present a systematic security investigation of multi-modal mobile GUI agents, addressing this critical gap in the existing literature. Our contributions are twofold: (1) we propose a novel threat modeling methodology, leading to the discovery and feasibility analysis of 34 previously unreported attacks, and (2) we design an attack framework to systematically construct and evaluate these threats. Through a combination of real-world case studies and extensive dataset-driven experiments, we validate the severity and practicality of those attacks, highlighting the pressing need for robust security measures in mobile GUI systems.

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

TL;DR

This paper systematically investigates security vulnerabilities in multi-modal mobile GUI agents that integrate LLMs/MLLMs. It introduces a five-step threat-modeling methodology and the SecMoba framework to construct and evaluate 34 previously unreported attacks, validated through real-world case studies and large-scale AITW-based evaluations. The work reveals notable weaknesses across perception, reasoning, memory, and collaboration modules, with nuanced effects from long-term memory and multi-agent collaboration on attack success. The findings highlight substantial security gaps and provide concrete defense directions, stressing the need for defense-aware mobile GUI designs and responsible disclosure practices to protect users in real-world deployments.

Abstract

The integration of Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) into mobile GUI agents has significantly enhanced user efficiency and experience. However, this advancement also introduces potential security vulnerabilities that have yet to be thoroughly explored. In this paper, we present a systematic security investigation of multi-modal mobile GUI agents, addressing this critical gap in the existing literature. Our contributions are twofold: (1) we propose a novel threat modeling methodology, leading to the discovery and feasibility analysis of 34 previously unreported attacks, and (2) we design an attack framework to systematically construct and evaluate these threats. Through a combination of real-world case studies and extensive dataset-driven experiments, we validate the severity and practicality of those attacks, highlighting the pressing need for robust security measures in mobile GUI systems.
Paper Structure (28 sections, 6 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 28 sections, 6 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: An illustration of possible attack paths in multi-modal mobile GUI agents.
  • Figure 2: Overview of $\mathtt{SecMoba}$ for constructing and evaluating different attacks against multi-modal mobile GUI agents.
  • Figure 3: Attack case 1: attacking the perception module of the agent to manipulate the user's app preference.
  • Figure 4: Attack case 2: attacking the reasoning module of the agent to hijack the user's purchasing decision.
  • Figure 5: Attack case 3: injecting distracting information on the wallpaper to achieve DoS against the agent.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: Multi-modal mobile GUI agent
  • Definition 2: Adversary from the untrusted environment