Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments
Yitong Zhang, Ximo Li, Liyi Cai, Jia Li
TL;DR
This work addresses EIAs against LVLM-powered GUI agents operating in open-world web environments by formalizing a dynamic-environment threat model and demonstrating that prior attacks fail under realistic conditions. It introduces Chameleon, a framework composed of LLM-Driven Environment Simulation (LES) to generate diverse, high-fidelity training contexts and Attention Black Hole (ABH) to explicitly steer model attention to the trigger region, with training losses combining $\mathcal{L}_{\text{CE}}$ and $\mathcal{L}_{\text{attn}}$. Across six websites and four LVLMs, Chameleon substantially outperforms baselines in ASR and shows partial cross-model transferability among related models, underscoring hidden vulnerabilities. Ablation and closed-loop evaluations confirm the necessity of LES and ABH, while defenses such as safety prompts, verifiers, and random noise offer limited protection or harm user experience. The findings highlight the need for robust, practical defenses tailored to open-world GUI agents to mitigate realistic environmental attacks without sacrificing usability, and point to future directions in automated trigger detection and resilient agent design.
Abstract
Graphical User Interface (GUI) agents are increasingly deployed to interact with online web services, yet their exposure to open-world content renders them vulnerable to Environmental Injection Attacks (EIAs). In these attacks, an attacker can inject crafted triggers into website to manipulate the behavior of GUI agents used by other users. In this paper, we find that most existing EIA studies fall short of realism. In particular, they fail to capture the dynamic nature of real-world web content, often assuming that a trigger's on-screen position and surrounding visual context remain largely consistent between training and testing. To better reflect practice, we introduce a realistic dynamic-environment threat model in which the attacker is a regular user and the trigger is embedded within a dynamically changing environment. Under this threat model, existing approaches largely fail, suggesting that their effectiveness in exposing GUI agent vulnerabilities has been substantially overestimated. To expose the hidden vulnerabilities of existing GUI agents effectively, we propose Chameleon, an attack framework with two key novelties designed for dynamic environments. (1) To synthesize more realistic training data, we introduce LLM-Driven Environment Simulation, which automatically generates diverse, high-fidelity webpage simulations that mimic the variability of real-world dynamic environments. (2) To optimize the trigger more effectively, we introduce Attention Black Hole, which converts attention weights into explicit supervisory signals. This mechanism encourages the agent to remain insensitive to irrelevant surrounding content, thereby improving robustness in dynamic environments. We evaluate Chameleon on six realistic websites and four representative LVLM-powered GUI agents, where it significantly outperforms existing methods.
