Table of Contents
Fetching ...

AdvAgent: Controllable Blackbox Red-teaming on Web Agents

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li

TL;DR

AdvAgent presents a black-box red-teaming framework that trains an adversarial prompter to inject invisible HTML prompts, steering web agents toward targeted actions. It employs a two-stage training pipeline (SFT and DPO) with RL from AI feedback to optimize prompts using only black-box responses. Experiments across Mind2Web tasks show high attack success rates (up to 97.5–99.8%), and defenses based on common prompt defenses offer limited protection. The work highlights significant vulnerabilities in current web agents and underscores the urgent need for stronger, real-time defense mechanisms and robust blue-teaming strategies.

Abstract

Foundation model-based agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We release code at https://ai-secure.github.io/AdvAgent/.

AdvAgent: Controllable Blackbox Red-teaming on Web Agents

TL;DR

AdvAgent presents a black-box red-teaming framework that trains an adversarial prompter to inject invisible HTML prompts, steering web agents toward targeted actions. It employs a two-stage training pipeline (SFT and DPO) with RL from AI feedback to optimize prompts using only black-box responses. Experiments across Mind2Web tasks show high attack success rates (up to 97.5–99.8%), and defenses based on common prompt defenses offer limited protection. The work highlights significant vulnerabilities in current web agents and underscores the urgent need for stronger, real-time defense mechanisms and robust blue-teaming strategies.

Abstract

Foundation model-based agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We release code at https://ai-secure.github.io/AdvAgent/.

Paper Structure

This paper contains 20 sections, 3 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of AdvAgent. We train an adversarial prompter model to generate adversarial strings added to the website. The injected string is hidden in invisible HTML fields and does not change the website rendering. Web agents working on the injected malicious website will be misled to perform targeted actions: buying Microsoft stocks can be attacked to buying NVIDIA stocks instead, leading to severe consequences.
  • Figure 2: AdvAgent Prompter Model Training. During data collection, we first collect the training dataset using LLM-based attack prompter by \ref{['alg:prompter']} in \ref{['sec:additional_algo']}. Then we collect positive and negative feedback from the target black-box model. During prompter model training, we first launch the first stage SFT using the positive subsets. The model is further trained in the second DPO stage using both positive and negative feedback.
  • Figure 3: Comparison of AdvAgent ASR with different training stages. We show the ASR of AdvAgent when trained using only the SFT stage versus trained with both the SFT and DPO stages. The results demonstrate that incorporating the DPO stage, which leverages both positive and negative feedback, leads to a significant improvement in ASR compared to using SFT alone.
  • Figure 4: Subtle differences in adversarial injections lead to different attack results. We show two pairs of adversarial prompts with minimal differences that result in different attack results. In the first pair, changing "you" to "I" makes the attack successful. In the second pair, adding the word "previous" successfully misleads the target agent.
  • Figure 5: Qualitative results of AdvAgent. We present two tasks from our test set. In the first task, the user instructs the agent to buy stocks from Microsoft. However, after the adversarial injection $q$ generated by AdvAgent, the agent purchases stocks from NVIDIA instead. In the second task, the user requests information on the side effects of Tylenol, but after the adversarial injection, the agent searches for Aspirin instead.