Table of Contents
Fetching ...

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea

TL;DR

MUZZLE addresses the risk of indirect prompt injection in LLM-based web agents by providing an automated, end-to-end red-teaming framework that operates in a sandboxed web environment. It identifies high-salience injection surfaces from the agent's trajectories, synthesizes context-aware malicious payloads, and iteratively refines attacks through feedback. The approach uncovers 37 end-to-end IPI attacks across 4 web applications and 10 adversarial objectives, including cross-application and agent-tailored phishing scenarios, demonstrating broad applicability and reduced human involvement. The results underscore MUZZLE's potential to systematically stress test web agents, reveal new attack surfaces, and guide the development of defenses and safer agent designs in real-world, interconnected web ecosystems.

Abstract

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

TL;DR

MUZZLE addresses the risk of indirect prompt injection in LLM-based web agents by providing an automated, end-to-end red-teaming framework that operates in a sandboxed web environment. It identifies high-salience injection surfaces from the agent's trajectories, synthesizes context-aware malicious payloads, and iteratively refines attacks through feedback. The approach uncovers 37 end-to-end IPI attacks across 4 web applications and 10 adversarial objectives, including cross-application and agent-tailored phishing scenarios, demonstrating broad applicability and reduced human involvement. The results underscore MUZZLE's potential to systematically stress test web agents, reveal new attack surfaces, and guide the development of defenses and safer agent designs in real-world, interconnected web ecosystems.

Abstract

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.
Paper Structure (26 sections, 5 equations, 9 figures, 11 tables)

This paper contains 26 sections, 5 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The three execution phases of Muzzle.
  • Figure 2: System architecture overview of Muzzle.
  • Figure 3: Example task spec for Muzzle. The agent is initialized with the provided information via the dependencies and requirements fields. Muzzle finds attacks that achieve each adversarial objective of the spec.
  • Figure 4: An end-to-end example of a cross-app attack discovery for Classifieds. The adversary instructs the web agent to navigate to Northwind and damage contents of the database.
  • Figure 5: Agentic phishing attack on The Zoo's Postmill web application. An adversary exploits the web agent’s task-following behavior to induce it to submit user credentials to a spoofed authentication interface, resulting in credential exfiltration.
  • ...and 4 more figures