Table of Contents
Fetching ...

A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

Carlos Peláez-González, Andrés Herrera-Poyatos, Cristina Zuheros, David Herrera-Poyatos, Virilo Tejedor, Francisco Herrera

TL;DR

This paper addresses jailbreak vulnerabilities in large language models by proposing a domain-based taxonomy that ties attacks to underlying training domains and alignment weaknesses. It reframes jailbreaks away from surface-level prompt templates toward fundamental deficiencies: mismatched generalization, competing objectives, and lack of robustness, with a fourth category for mixed attacks. The authors formalize explicit versus implicit training domains, map various attack classes to these domains, and provide cross-modal considerations for text and vision modalities. The framework offers a principled basis for evaluating defenses, guiding the development of more resilient, multimodal alignment strategies and highlighting open research questions.

Abstract

The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories -- mismatched generalization, competing objectives, adversarial robustness, and mixed attacks -- offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.

A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

TL;DR

This paper addresses jailbreak vulnerabilities in large language models by proposing a domain-based taxonomy that ties attacks to underlying training domains and alignment weaknesses. It reframes jailbreaks away from surface-level prompt templates toward fundamental deficiencies: mismatched generalization, competing objectives, and lack of robustness, with a fourth category for mixed attacks. The authors formalize explicit versus implicit training domains, map various attack classes to these domains, and provide cross-modal considerations for text and vision modalities. The framework offers a principled basis for evaluating defenses, guiding the development of more resilient, multimodal alignment strategies and highlighting open research questions.

Abstract

The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories -- mismatched generalization, competing objectives, adversarial robustness, and mixed attacks -- offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.

Paper Structure

This paper contains 36 sections, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: A summarized version of our proposed taxonomy. The complete taxonomy is described on \ref{['sec:taxonomy']}
  • Figure 2: Characterization of an LLM’s training domains. The self-supervised domain encompasses the core model knowledge. The helpful and harmful domains are part of the alignment dataset. The overlap between helpful and harmful domains presents the alignment process as a multi-objective optimization task.
  • Figure 3: Our proposed taxonomy for jailbreak attacks to Large Language Models. There are four groups of attacks: mismatched generalization, competing objectives, adversarial robustness and mixed attacks. BB and WB stands for Black-Box and White-Box access, respectively
  • Figure 4: Prompt structure for chat models, showcasing the system prompt, the user queries and model generated tokens.
  • Figure 5: A common vision model architecture. Text and image embeddings are concatenated as they share a common latent space