Table of Contents
Fetching ...

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer Yue

TL;DR

The paper argues that defenses that perform well against automated, single-turn attacks do not generalize to realistic multi-turn jailbreaks conducted by humans. It introduces MHJ, a large dataset of 2,912 prompts across 537 multi-turn jailbreak conversations, and demonstrates that multi-turn human attackers achieve significantly higher attack success rates on HarmBench and unlearned models than automated baselines. By combining human red teaming with a harm classifier and a public tactic taxonomy, the work reveals systematic defense vulnerabilities and provides a resource for evaluating and strengthening LLM safety. The findings advocate for expanded threat models and more robust automated adversaries to improve real-world resilience of LLM defenses, and the authors release MHJ to spur further research in this domain.

Abstract

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

TL;DR

The paper argues that defenses that perform well against automated, single-turn attacks do not generalize to realistic multi-turn jailbreaks conducted by humans. It introduces MHJ, a large dataset of 2,912 prompts across 537 multi-turn jailbreak conversations, and demonstrates that multi-turn human attackers achieve significantly higher attack success rates on HarmBench and unlearned models than automated baselines. By combining human red teaming with a harm classifier and a public tactic taxonomy, the work reveals systematic defense vulnerabilities and provides a resource for evaluating and strengthening LLM safety. The findings advocate for expanded threat models and more robust automated adversaries to improve real-world resilience of LLM defenses, and the authors release MHJ to spur further research in this domain.

Abstract

Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
Paper Structure (53 sections, 8 figures, 3 tables)

This paper contains 53 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (Left): Attack success rate (ASR) of humans and six automated attacks against LLM defenses on HarmBench behaviors (n=240); full results in \ref{['fig:main_detailed']} and \ref{['tab:main_table']}. Ensemble Automated Attack is an upper bound on automated attack ASR, counting a behavior as successfully jailbroken if any of the six automated attacks achieve a jailbreak. *CYGNET is closed-source; automated attack results are cited from zou2024improvingalignmentrobustnesscircuit and should not be directly compared with human ASR (\ref{['app:harmbench-evaluation-cygnet']}). (Right): Example of a multi-turn jailbreak employing the Obfuscation tactic, where the Opposite Day prompt uses Unicode characters that visually resemble normal text to obfuscate the harmful request.
  • Figure 2: Our human jailbreak pipeline. Up to two independent red teamers attempt a jailbreak in the "Attempt" phase, followed by a "Validate" phase to verify the jailbreak, with the possibility of a third red teamer for potential false positives. GPT-4o is used as a final filter for improved precision.
  • Figure 3: Attack success rate of human and automatic attacks on HarmBench test questions (n=240); ASR percentages are in \ref{['tab:main_table']}. *CYGNET is closed source, hence results for AutoDAN, GCG, and PAIR are cited from the original paper zou2024improvingalignmentrobustnesscircuit and should not be directly compared against human ASR (\ref{['app:harmbench-evaluation-cygnet']}).
  • Figure 4: ASR against the RMU unlearning method, on open-ended WMDP-Bio questions (n=43).
  • Figure 5: Distribution of primary tactics for successful human attacks on HarmBench.
  • ...and 3 more figures