Table of Contents
Fetching ...

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan Wang

TL;DR

This paper investigates whether safety refusals learned by refusal-trained LLMs in chat translate to browser-based agent use. It introduces BrowserART, a 100-item red-teaming toolkit with synthetic and real-world websites to test harmful content and interactions for browser agents. Empirical results show a large drop in safety alignment for browser agents compared to chatbots, with direct-ask and jailbreaking techniques transferring effectively and, in some cases, achieving near-100% attack success rates. The work highlights a critical gap in agent safety, advocates broader collaboration for defense, and publicly releases BrowserART to spur progress in safeguarding LLM-enabled agents.

Abstract

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

TL;DR

This paper investigates whether safety refusals learned by refusal-trained LLMs in chat translate to browser-based agent use. It introduces BrowserART, a 100-item red-teaming toolkit with synthetic and real-world websites to test harmful content and interactions for browser agents. Empirical results show a large drop in safety alignment for browser agents compared to chatbots, with direct-ask and jailbreaking techniques transferring effectively and, in some cases, achieving near-100% attack success rates. The work highlights a critical gap in agent safety, advocates broader collaboration for defense, and publicly releases BrowserART to spur progress in safeguarding LLM-enabled agents.

Abstract

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety

Paper Structure

This paper contains 37 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Top (motivation of our proposed red teaming suite BrowserART): while refusal-trained LLMs as chatbots are generally expected to refuse harmful instructions from malicious users, providing them with web browser access and prompting them as agents can significantly decrease the alignment. Bottom (result preview): We directly ask (i.e., w/o attacks) all LLMs and agents to fulfill harmful behaviors. We also employ LLM attack techniques to further jailbreak browser agents. A preview of results for GPT-4o and o1-preview is shown here. Attack Success Rate (ASR): the percentage of harmful behaviors attempted by a LLM or a browser agent. For full results, see Figure \ref{['fig:direct-ask']} and Table \ref{['tab:openhands']}.
  • Figure 2: Overall distribution of behavior sources. We sample behaviors from HarmBench mazeika2024harmbench and AirBench 2024 zeng2024air when creating the chat behaviors dataset in BrowserART.
  • Figure 3: Left: the distribution of behavior categories. Right: the distribution of websites in BrowserART.
  • Figure 4: Examples behaviors, associated websites and harm classification methods. For any behavior incorporating harmful content, we use an LLM to classify the generated contents by the agent. For behaviors incorporating harmful interaction, the harm classification method using an LLM as a classifier on the agent's trajectory.
  • Figure 5: We compute Attack Success Rate (ASR) by directly asking an LLM to fulfill the harmful chat behaviors in BrowserART. For an LLM browser agent (implemented with OpenHands), we use the corresponding browser behaviors in BrowserART. Because LLMs are refusal-trained, The ASRs are expected to be 0 here. Since the agentic use often includes a long-context observation and the action history in the user prompt, AXTree + Chat Behavior is a sanity check to see if a prefix with 25K tokens to a direct ask alone can jailbreak LLMs.
  • ...and 1 more figures