WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong
TL;DR
WAInjectBench delivers the first systematic benchmark for prompt injection detections targeting web agents, addressing a gap where prior defenses were evaluated outside agent contexts. By introducing a fine-grained attack taxonomy, a large multi-modal dataset (text and image) of malicious and benign samples, and a broad survey of detectors (text-based, image-based, and ensembles), the work benchmarks 12 detectors and analyzes performance across diverse scenarios. Key findings show detectors reliably catching explicit instructions and visible perturbations but struggling with inconspicuous or imperceptible attacks, and that ensembling improves coverage at the cost of higher false positives. The resulting benchmark, datasets, and open-source code provide a foundation for developing more robust, cross-modal defenses to improve the safety and trustworthiness of web agents in real-world deployments.
Abstract
Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.
