Table of Contents
Fetching ...

Jailbreaking to Jailbreak

Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang

TL;DR

The paper investigates jailbreaking-to-jailbreak ($J_2$), a failure mode in which refusal-trained LLMs are coaxed to assist jailbreaks against other models, including copies of themselves. It proposes a model-agnostic workflow with planning, attack, and debrief phases, plus an in-context learning loop, to turn strong LLMs into versatile red-teamers that generate transferable jailbreak prompts. Across diverse API models and HarmBench harms, $J_2$ attackers show notable transferability and high ASRs, with frontier models achieving rapid improvements and—even without human prompts—matching or surpassing several baselines in many cases. The findings reveal a scalable, human-in-the-loop red-teaming paradigm that can rapidly identify and exploit guardrail weaknesses, underscoring the need for robust safeguards and ongoing monitoring as LLM capabilities evolve. The work highlights practical implications for model builders and safety research, advocating for controlled deployment and enhanced defenses to withstand evolving automated red-teaming approaches.

Abstract

Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting $J_2$ (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create $J_2$ attackers transfer across almost all black-box models; 2) an $J_2$ attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong models, such as Sonnet-3.7, are strong $J_2$ attackers compared to others. For example, when used against the safeguard of GPT-4o, $J_2$ (Sonnet-3.7) achieves 0.975 attack success rate (ASR), which matches expert human red teamers and surpasses the state-of-the-art algorithm-based attacks. Among $J_2$ attackers, $J_2$ (o3) achieves highest ASR (0.605) against Sonnet-3.5, one of the most robust models.

Jailbreaking to Jailbreak

TL;DR

The paper investigates jailbreaking-to-jailbreak (), a failure mode in which refusal-trained LLMs are coaxed to assist jailbreaks against other models, including copies of themselves. It proposes a model-agnostic workflow with planning, attack, and debrief phases, plus an in-context learning loop, to turn strong LLMs into versatile red-teamers that generate transferable jailbreak prompts. Across diverse API models and HarmBench harms, attackers show notable transferability and high ASRs, with frontier models achieving rapid improvements and—even without human prompts—matching or surpassing several baselines in many cases. The findings reveal a scalable, human-in-the-loop red-teaming paradigm that can rapidly identify and exploit guardrail weaknesses, underscoring the need for robust safeguards and ongoing monitoring as LLM capabilities evolve. The work highlights practical implications for model builders and safety research, advocating for controlled deployment and enhanced defenses to withstand evolving automated red-teaming approaches.

Abstract

Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create attackers transfer across almost all black-box models; 2) an attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong models, such as Sonnet-3.7, are strong attackers compared to others. For example, when used against the safeguard of GPT-4o, (Sonnet-3.7) achieves 0.975 attack success rate (ASR), which matches expert human red teamers and surpasses the state-of-the-art algorithm-based attacks. Among attackers, (o3) achieves highest ASR (0.605) against Sonnet-3.5, one of the most robust models.

Paper Structure

This paper contains 71 sections, 10 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: We focus on jailbreaking to jailbreak, unleashing refusal-trained LLMs to attack other models (including a copy of themselves). We provide the proposed workflow (middle) and a preview of results (right).
  • Figure 2: An overview of our red teaming workflow. We first create $J_2$ attackers. Second, $J_2$ jailbreaks the target LLM in multi-turn conversations with hard-coded prompts to do planning and debriefing. We iterate over different red teaming strategies until a successful jailbreak is founded or we exhaust our strategy set.
  • Figure 3: A plot of the self-attack success rates, using $J_2$ (model A) to attack model A, against the release date of the model API endpoint. Results are over 50 selected HarmBench text behaviors.
  • Figure 4: Human strategies employed in Section \ref{['sec:eval:capability-eval']}, which provided to $J_2$ attackers in the planning phase following the shown sequence. Detailed descriptions for each strategy are in Appendix \ref{['appendix:strategies']}.
  • Figure 5: Attack success rates on the safegaurd of GPT-4o (left) and Sonnet-3.5 (right) with different attack methods. For each $J_2$ attacker, the darker bar corresponds to the case when it succeeds with the dealers_choice (i.e. to pick its own strategy) and the lighter bar is when it fails with the dealers_choice but later finds successful jailbreaks with human-curated strategies from Figure \ref{['fig:strategies']}.
  • ...and 8 more figures