JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Haibo Jin; Leyang Hu; Xinnuo Li; Peiyan Zhang; Chonghan Chen; Jun Zhuang; Haohan Wang

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang

TL;DR

This survey provides a unified framework for jailbreaking LLMs and VLMs, categorizing attack methods into seven types and surveying corresponding defenses. It links textual and multimodal security perspectives, highlighting evaluation gaps and suggesting future directions for robust, aligned, and secure AI systems. The study catalogs detailed attack and defense taxonomies, analyzes existing benchmarks, and identifies cross-modal vulnerabilities requiring coordinated defenses. The results offer a roadmap for researchers and practitioners to enhance safety and reliability in next-generation language models.

Abstract

The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking--deliberately circumventing the ethical and operational boundaries of LLMs and VLMs--and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models. More details can be found on our website: https://chonghan-chen.com/llm-jailbreak-zoo-survey/.

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

TL;DR

Abstract

Paper Structure (36 sections, 11 equations, 23 figures)

This paper contains 36 sections, 11 equations, 23 figures.

Introduction
Background
Ethical Alignment
Prompt-tuning Alignment
Reinforcement Learning from Human Feedback
Jailbreaking process of Large Language and Vision-Language Models
Jailbreaking Large Language Models
Jailbreaking Vision-Language Models
Threats in Large Language Models
Jailbreak Strategies on Language Language Models
Gradient-based Jailbreaks
Evolutionary-based Jailbreaks
Demonstration-based Jailbreaks
Rule-based Jailbreaks
Multi-agent-based Jailbreaks
...and 21 more sections

Figures (23)

Figure : An illustrative case of a successful jailbreak on an LLM: The jailbreak prompt is highlighted in orange, while the jailbreak response is marked in red.
Figure : Overall structure of our paper, which provides a comprehensive overview of our paper, categorizing the ethical alignment techniques, jailbreak processes, threats, and defense mechanisms within LLMs and VLMs. We illustrate the organization of the sections, starting from background information and ethical alignment techniques, progressing through the jailbreak processes for LLMs and VLMs, and detailing the respective threats and defense strategies for both types of models.
Figure : Overview of Jailbreak Strategies for LLMs: This figure delineates the five principal approaches to jailbreaking LLMs. Gradient-based Jailbreaks exploit model gradients to create prompts that compel LLMs to produce harmful responses. Evolutionary-based Jailbreaks utilize genetic algorithms and evolutionary strategies to generate effective adversarial prompts. Demonstration-based Jailbreaks craft specific, static system prompts to direct LLM responses toward desired outcomes. Rule-based Jailbreaks decompose and redirect malicious prompts through predefined rules to evade detection and produce intended outputs. Multi-agent-based Jailbreaks rely on the cooperation of multiple LLMs to iteratively refine and enhance jailbreak prompts.
Figure : An example of gradient-based jailbreaks. The process begins with the selection of an initial token from the vocabulary, followed by gradient-based optimization to generate candidate tokens. The left box within the blue box represents the candidate tokens that need to be selected, while the right box represents the tokens selected after one optimization process. These candidate tokens are iteratively refined through left-to-right generation until the desired malicious response is achieved, ensuring convergence and concatenation to form the final harmful output.
Figure : An example of evolutionary-based jailbreaks. The process begins with an attacker providing a prototype prompt that initializes the model by setting aside previous guidelines. This initialization phase is followed by a fitness evaluation, where responses are assessed for their alignment with malicious intent. The hierarchical genetic policy phase then employs paragraph-level and sentence-level crossover, along with LLM-based mutations, to refine and optimize the prompts. This iterative process continues until a harmful response is successfully produced.
...and 18 more figures

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

TL;DR

Abstract

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (23)