Table of Contents
Fetching ...

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang

TL;DR

JailbreakRadar provides a unified, large-scale benchmark for evaluating jailbreak attacks against aligned LLMs, compiling 17 attack methods, a 160-question forbidden dataset, nine LLMs, and eight defenses within a new taxonomy. It demonstrates that no current LLM is fully secure: seed-based attacks can be highly effective but are easily mitigated by defenses, while seedless and feedback-driven approaches maintain substantial risk. The work offers deep analyses, including ablations and transferability studies, and shows how defenses perform differently across attack types and violation categories. By delivering a broad, reproducible evaluation framework and a diverse dataset aligned to multiple providers’ policies, it aims to deter incremental work and guide the development of safer, more trustworthy LLMs.

Abstract

Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation -- either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to mitigate by defenses, causing low practicality. Our study offers valuable insights for future research on jailbreak attacks and defenses. We hope our work could help the community avoid incremental work and serve as an effective benchmark tool for practitioners.

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

TL;DR

JailbreakRadar provides a unified, large-scale benchmark for evaluating jailbreak attacks against aligned LLMs, compiling 17 attack methods, a 160-question forbidden dataset, nine LLMs, and eight defenses within a new taxonomy. It demonstrates that no current LLM is fully secure: seed-based attacks can be highly effective but are easily mitigated by defenses, while seedless and feedback-driven approaches maintain substantial risk. The work offers deep analyses, including ablations and transferability studies, and shows how defenses perform differently across attack types and violation categories. By delivering a broad, reproducible evaluation framework and a diverse dataset aligned to multiple providers’ policies, it aims to deter incremental work and guide the development of safer, more trustworthy LLMs.

Abstract

Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation -- either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks could achieve high attack success rates but are easy to mitigate by defenses, causing low practicality. Our study offers valuable insights for future research on jailbreak attacks and defenses. We hope our work could help the community avoid incremental work and serve as an effective benchmark tool for practitioners.
Paper Structure (59 sections, 1 equation, 12 figures, 14 tables)

This paper contains 59 sections, 1 equation, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Examples of different jailbreak settings.
  • Figure 2: Overview of our assessment process.
  • Figure 3: Fine-grained ASRs for direct attacks of each method on various violation categories (Llama3.1).
  • Figure 4: Fine-grained ASRs for transfer attacks of each method on various violation categories (closed-source settings).
  • Figure 5: Fine-grained ASRs for transfer attacks of each method on various violation categories (open-source settings).
  • ...and 7 more figures