Table of Contents
Fetching ...

Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments

Yitong Zhang, Ximo Li, Liyi Cai, Jia Li

TL;DR

This work addresses EIAs against LVLM-powered GUI agents operating in open-world web environments by formalizing a dynamic-environment threat model and demonstrating that prior attacks fail under realistic conditions. It introduces Chameleon, a framework composed of LLM-Driven Environment Simulation (LES) to generate diverse, high-fidelity training contexts and Attention Black Hole (ABH) to explicitly steer model attention to the trigger region, with training losses combining $\mathcal{L}_{\text{CE}}$ and $\mathcal{L}_{\text{attn}}$. Across six websites and four LVLMs, Chameleon substantially outperforms baselines in ASR and shows partial cross-model transferability among related models, underscoring hidden vulnerabilities. Ablation and closed-loop evaluations confirm the necessity of LES and ABH, while defenses such as safety prompts, verifiers, and random noise offer limited protection or harm user experience. The findings highlight the need for robust, practical defenses tailored to open-world GUI agents to mitigate realistic environmental attacks without sacrificing usability, and point to future directions in automated trigger detection and resilient agent design.

Abstract

Graphical User Interface (GUI) agents are increasingly deployed to interact with online web services, yet their exposure to open-world content renders them vulnerable to Environmental Injection Attacks (EIAs). In these attacks, an attacker can inject crafted triggers into website to manipulate the behavior of GUI agents used by other users. In this paper, we find that most existing EIA studies fall short of realism. In particular, they fail to capture the dynamic nature of real-world web content, often assuming that a trigger's on-screen position and surrounding visual context remain largely consistent between training and testing. To better reflect practice, we introduce a realistic dynamic-environment threat model in which the attacker is a regular user and the trigger is embedded within a dynamically changing environment. Under this threat model, existing approaches largely fail, suggesting that their effectiveness in exposing GUI agent vulnerabilities has been substantially overestimated. To expose the hidden vulnerabilities of existing GUI agents effectively, we propose Chameleon, an attack framework with two key novelties designed for dynamic environments. (1) To synthesize more realistic training data, we introduce LLM-Driven Environment Simulation, which automatically generates diverse, high-fidelity webpage simulations that mimic the variability of real-world dynamic environments. (2) To optimize the trigger more effectively, we introduce Attention Black Hole, which converts attention weights into explicit supervisory signals. This mechanism encourages the agent to remain insensitive to irrelevant surrounding content, thereby improving robustness in dynamic environments. We evaluate Chameleon on six realistic websites and four representative LVLM-powered GUI agents, where it significantly outperforms existing methods.

Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments

TL;DR

This work addresses EIAs against LVLM-powered GUI agents operating in open-world web environments by formalizing a dynamic-environment threat model and demonstrating that prior attacks fail under realistic conditions. It introduces Chameleon, a framework composed of LLM-Driven Environment Simulation (LES) to generate diverse, high-fidelity training contexts and Attention Black Hole (ABH) to explicitly steer model attention to the trigger region, with training losses combining and . Across six websites and four LVLMs, Chameleon substantially outperforms baselines in ASR and shows partial cross-model transferability among related models, underscoring hidden vulnerabilities. Ablation and closed-loop evaluations confirm the necessity of LES and ABH, while defenses such as safety prompts, verifiers, and random noise offer limited protection or harm user experience. The findings highlight the need for robust, practical defenses tailored to open-world GUI agents to mitigate realistic environmental attacks without sacrificing usability, and point to future directions in automated trigger detection and resilient agent design.

Abstract

Graphical User Interface (GUI) agents are increasingly deployed to interact with online web services, yet their exposure to open-world content renders them vulnerable to Environmental Injection Attacks (EIAs). In these attacks, an attacker can inject crafted triggers into website to manipulate the behavior of GUI agents used by other users. In this paper, we find that most existing EIA studies fall short of realism. In particular, they fail to capture the dynamic nature of real-world web content, often assuming that a trigger's on-screen position and surrounding visual context remain largely consistent between training and testing. To better reflect practice, we introduce a realistic dynamic-environment threat model in which the attacker is a regular user and the trigger is embedded within a dynamically changing environment. Under this threat model, existing approaches largely fail, suggesting that their effectiveness in exposing GUI agent vulnerabilities has been substantially overestimated. To expose the hidden vulnerabilities of existing GUI agents effectively, we propose Chameleon, an attack framework with two key novelties designed for dynamic environments. (1) To synthesize more realistic training data, we introduce LLM-Driven Environment Simulation, which automatically generates diverse, high-fidelity webpage simulations that mimic the variability of real-world dynamic environments. (2) To optimize the trigger more effectively, we introduce Attention Black Hole, which converts attention weights into explicit supervisory signals. This mechanism encourages the agent to remain insensitive to irrelevant surrounding content, thereby improving robustness in dynamic environments. We evaluate Chameleon on six realistic websites and four representative LVLM-powered GUI agents, where it significantly outperforms existing methods.

Paper Structure

This paper contains 24 sections, 11 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of the dynamic environment in realistic GUI agent application scenarios. The two screenshots show consecutive searches for "Apple" on the same https://search.jd.com/. We assume the image highlighted by the red box is a trigger uploaded by a attacker.
  • Figure 2: Attention maps for the two cases. The red box marks the trigger image; warmer colors indicate higher attention. In the successful case, attention is concentrated on the trigger image region, whereas in the unsuccessful case, attention is dispersed across the screenshot.
  • Figure 3: Overview of our proposed Chameleon.
  • Figure 4: Illustrative screenshots for each website. Trigger images are outlined in red.
  • Figure 5: Transferability of Chameleon across models. Each cell represents the ASR (%) when the trigger image is trained on the surrogate model (row) and tested on the target model (column).
  • ...and 3 more figures