Dissecting Adversarial Robustness of Multimodal LM Agents

Chen Henry Wu; Rishi Shah; Jing Yu Koh; Ruslan Salakhutdinov; Daniel Fried; Aditi Raghunathan

Dissecting Adversarial Robustness of Multimodal LM Agents

Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

TL;DR

The paper tackles the problem of adversarial robustness in multimodal LM agents deployed in realistic web environments. It introduces ARE, a graph-based framework that measures how adversarial information propagates through agent components and defines edge-wise robustness via $\lambda(e)$ to dissect attack impact. By extending VisualWebArena to VWA-Adv with 200 curated adversarial tasks, the authors demonstrate that even leading agents using black-box LMs with reflection or tree search can be hijacked with imperceptible perturbations, achieving up to $67\%$ adversarial success rate and revealing how new components can both help and create new vulnerabilities. The work also explores defenses—many with limited effectiveness—and highlights the need for principled, component-level robustness strategies as agent architectures grow more complex; all tasks and code are released to support ongoing research.

Abstract

As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components taking actions, which existing LMs safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena, a real environment for web agents. To systematically examine the robustness of agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search. With imperceptible perturbations to a single image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added. We find that inference-time compute that typically improves benign performance can open up new vulnerabilities and harm robustness. An attacker can compromise the evaluator used by the reflexion agent and the value function of the tree search agent, which increases the attack success relatively by 15% and 20%. Our data and code for attacks, defenses, and evaluation are at https://github.com/ChenWu98/agent-attack

Dissecting Adversarial Robustness of Multimodal LM Agents

TL;DR

to dissect attack impact. By extending VisualWebArena to VWA-Adv with 200 curated adversarial tasks, the authors demonstrate that even leading agents using black-box LMs with reflection or tree search can be hijacked with imperceptible perturbations, achieving up to

adversarial success rate and revealing how new components can both help and create new vulnerabilities. The work also explores defenses—many with limited effectiveness—and highlights the need for principled, component-level robustness strategies as agent architectures grow more complex; all tasks and code are released to support ongoing research.

Abstract

Paper Structure (29 sections, 3 equations, 13 figures, 7 tables)

This paper contains 29 sections, 3 equations, 13 figures, 7 tables.

Introduction
Related Work
Agent Robustness Evaluation
Threat model
Agent Graph
Propagation of Attacks along Edges
Adversarial Robustness of Agents in VisualWebArena
Curation of Adversarial Tasks
Attacker Access
Attack Methods
Evaluating the robustness of agents on VWA-Adv
Robustness of Policy Models
Robustness of Reflexion Agents with Evaluators
Robustness of Tree Search Agents with Value Functions
Defenses
...and 14 more sections

Figures (13)

Figure 1: We study the robustness of agents under targeted adversarial attacks. The attack is injected in the environment (as text or image), and we evaluate if the agent achieves the adversarial goal.
Figure 2: An agent graph shows how information flows when the agent interacts with the environment. Arrows denote the flow of intermediate outputs between components.
Figure 3: Adding a new component to an agent can either improve or harm robustness. If $B$ only receives input (if any) from the trusted environment, $B$ would lower $\lambda$. However, an attacker can also attack this new component (introducing an edge of weight $1$) that could increase $\lambda$.
Figure 4: Robustness of policy models. Left: robustness decomposition of a GPT-4o policy model. Right: robustness-utility trade-off. $^*$Benign tasks are selected based on GPT-4V's performance.
Figure 5: Contribution of evaluators to agent robustness. $^*$Captioners are omitted. The numbers on the edges are edge weights $\lambda(e)$ defined in §\ref{['sec:framework']}.
...and 8 more figures

Dissecting Adversarial Robustness of Multimodal LM Agents

TL;DR

Abstract

Dissecting Adversarial Robustness of Multimodal LM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (13)