On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks
Ting Bi, Chenghang Ye, Zheyu Yang, Ziyi Zhou, Cui Tang, Jun Zhang, Zui Tao, Kailong Wang, Liting Zhou, Yang Yang, Tianlong Yu
TL;DR
This work investigates the feasibility of AR-driven social engineering using Multimodal LLMs through the SEAR framework, which combines AR-based social context synthesis, role-based multimodal retrieval augmentation, and a ReInteract SE agent for adaptive attack strategies. The authors validate SEAR via an IRB-approved study with 60 participants across three configurations, collecting 180 annotated conversations in simulated social scenarios. Results show SEAR markedly increases trust and susceptibility to high-risk behaviors (e.g., 93.3% likely to click phishing links, 93% to accept social app requests), underscoring a significant security risk and the need for defensive measures. The paper contributes a practical AR-LLM attack prototype, an open IRB dataset, and a foundation for developing AR/LLM defense mechanisms against advanced social engineering threats.
Abstract
Augmented Reality (AR) and Multimodal Large Language Models (LLMs) are rapidly evolving, providing unprecedented capabilities for human-computer interaction. However, their integration introduces a new attack surface for social engineering. In this paper, we systematically investigate the feasibility of orchestrating AR-driven Social Engineering attacks using Multimodal LLM for the first time, via our proposed SEAR framework, which operates through three key phases: (1) AR-based social context synthesis, which fuses Multimodal inputs (visual, auditory and environmental cues); (2) role-based Multimodal RAG (Retrieval-Augmented Generation), which dynamically retrieves and integrates contextual data while preserving character differentiation; and (3) ReInteract social engineering agents, which execute adaptive multiphase attack strategies through inference interaction loops. To verify SEAR, we conducted an IRB-approved study with 60 participants in three experimental configurations (unassisted, AR+LLM, and full SEAR pipeline) compiling a new dataset of 180 annotated conversations in simulated social scenarios. Our results show that SEAR is highly effective at eliciting high-risk behaviors (e.g., 93.3% of participants susceptible to email phishing). The framework was particularly effective in building trust, with 85% of targets willing to accept an attacker's call after an interaction. Also, we identified notable limitations such as ``occasionally artificial'' due to perceived authenticity gaps. This work provides proof-of-concept for AR-LLM driven social engineering attacks and insights for developing defensive countermeasures against next-generation augmented reality threats.
