Table of Contents
Fetching ...

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng

TL;DR

Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline, significantly outperforms established training-based and prompt-based defenses.

Abstract

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

TL;DR

Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline, significantly outperforms established training-based and prompt-based defenses.

Abstract

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.
Paper Structure (90 sections, 4 equations, 20 figures, 5 tables, 1 algorithm)

This paper contains 90 sections, 4 equations, 20 figures, 5 tables, 1 algorithm.

Figures (20)

  • Figure 1: Overview of the agent--attacker interaction process. At each timestep, the attacker observes the clean webpage state and injects malicious HTML/CSS into the DOM, simultaneously corrupting both the screenshot and the accessibility tree. The agent then acts on the modified observations, and the environment transitions accordingly.
  • Figure 2: Illustration of the HTML injection mechanism. The attacker VLM processes the clean state (b) to generate a structured action $\alpha_t$ (a) with color-coded components. This payload is injected into the DOM to produce the malicious state (c).
  • Figure 3: Cross-evaluation heatmaps of agent and attacker checkpoints across different training stages. Each cell reports the success rate when pairing a specific agent checkpoint (column) against a specific attacker checkpoint (row).
  • Figure 4: Diversity of attacker-generated HTML injections across self-play iterations. (a) Distinct-$n$ measures lexical diversity (unique $n$-grams / total $n$-grams); consistent growth indicates increasingly varied attack text. (b) Self-BLEU measures $n$-gram overlap between samples; decreasing scores confirm attacks become less repetitive over training.
  • Figure 5: Original (clean) webpage
  • ...and 15 more figures