Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents
Yulong Yang, Xinshan Yang, Shuaidong Li, Chenhao Lin, Zhengyu Zhao, Chao Shen, Tianwei Zhang
TL;DR
This paper systematically investigates security vulnerabilities in multi-modal mobile GUI agents that integrate LLMs/MLLMs. It introduces a five-step threat-modeling methodology and the SecMoba framework to construct and evaluate 34 previously unreported attacks, validated through real-world case studies and large-scale AITW-based evaluations. The work reveals notable weaknesses across perception, reasoning, memory, and collaboration modules, with nuanced effects from long-term memory and multi-agent collaboration on attack success. The findings highlight substantial security gaps and provide concrete defense directions, stressing the need for defense-aware mobile GUI designs and responsible disclosure practices to protect users in real-world deployments.
Abstract
The integration of Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) into mobile GUI agents has significantly enhanced user efficiency and experience. However, this advancement also introduces potential security vulnerabilities that have yet to be thoroughly explored. In this paper, we present a systematic security investigation of multi-modal mobile GUI agents, addressing this critical gap in the existing literature. Our contributions are twofold: (1) we propose a novel threat modeling methodology, leading to the discovery and feasibility analysis of 34 previously unreported attacks, and (2) we design an attack framework to systematically construct and evaluate these threats. Through a combination of real-world case studies and extensive dataset-driven experiments, we validate the severity and practicality of those attacks, highlighting the pressing need for robust security measures in mobile GUI systems.
