Table of Contents
Fetching ...

LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

Zihe Yan, Zhuosheng Zhang, Jiaping Gui, Gongshen Liu

Abstract

Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation. Our code can be found in https://github.com/YANGTUOMAO/LaSM.

LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

Abstract

Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation. Our code can be found in https://github.com/YANGTUOMAO/LaSM.

Paper Structure

This paper contains 43 sections, 5 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Attention heatmaps from different layers over the same input image. Brighter regions indicate stronger attention on relevant areas. Heatmaps are generated with the Qwen2-vl-7B model.
  • Figure 2: Each subfigure shows attention heatmaps (left) and layerwise cosine similarities (right) over the target region (red box). (a) corresponds to the <button-confirm> region, and (b) to the <icon-cross> region.
  • Figure 3: Illustration of direct scaling applied to layers (highlighted in red) with highest cosine similarity variance, targeting both attention and MLP weights.
  • Figure 4: Illustration of progressive layer range narrowing, where the final narrowed range is marked by the layers highlighted in red.
  • Figure 5: DSR comparison under different layer scaling strategies.
  • ...and 13 more figures