Table of Contents
Fetching ...

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, Kamalika Chaudhuri

TL;DR

WASP introduces an end-to-end benchmark for evaluating the security of autonomous web navigation agents against prompt injection attacks in sandboxed web environments. Built on VisualWebArena with GitLab and Reddit clones, the benchmark defines realistic attacker goals, dual prompt-injection templates (URL and plain-text), and 84 tasks per environment to assess both attacker success and agent performance. Experiments across multiple backbones and scaffoldings reveal high intermediate attack susceptibility (up to 86%) but relatively low end-to-end attacker success (up to 16%), highlighting security by incompetence in current agents. The study provides public benchmarks and mitigations while outlining limitations and future work to broaden domains and defenses.

Abstract

Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP -- a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals -- highlighting the current state of security by incompetence.

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

TL;DR

WASP introduces an end-to-end benchmark for evaluating the security of autonomous web navigation agents against prompt injection attacks in sandboxed web environments. Built on VisualWebArena with GitLab and Reddit clones, the benchmark defines realistic attacker goals, dual prompt-injection templates (URL and plain-text), and 84 tasks per environment to assess both attacker success and agent performance. Experiments across multiple backbones and scaffoldings reveal high intermediate attack susceptibility (up to 86%) but relatively low end-to-end attacker success (up to 16%), highlighting security by incompetence in current agents. The study provides public benchmarks and mitigations while outlining limitations and future work to broaden domains and defenses.

Abstract

Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP -- a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals -- highlighting the current state of security by incompetence.

Paper Structure

This paper contains 66 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: (a) Snapshot of the results on our benchmark. ASR-intermediate checks whether the agent backed with this model was hijacked and diverted from the original user objective, whereas ASR-end-to-end checks whether the attacker's goal was achieved. (b) Screenshots of the websites after malicious prompts were injected. Attacker creates an issue on GitLab encouraging the agent to follow new instruction. We assume the attacker can only control specific webpage elements (highlighted in red).
  • Figure 2: A primer with end-to-end attack success. User task: "Upvote the given reddit post". Agent: GPT-4o with VisualWebArena scaffolding (axtree + SOM) without defensive system prompt. We show observations (screenshots) at several time steps $t$ (bottom) and the model reasoning (middle).
  • Figure 3: Flow of each of the 21 attacker goals from \ref{['tab:all-goals']} across three main steps during agent execution: first action, intermediate steps, and final outcome. This plot pertains to a single user instruction on GPT-4o with VisualWebArena scaffolding (axtree+SOM) without a defensive system prompt.