Table of Contents
Fetching ...

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

Jingtong Su, Julia Kempe, Karen Ullrich

TL;DR

This work analyzes jailbreaking of LLMs from a statistical viewpoint, showing that high-quality pretraining will cause models to imitate harmful behavior present in the training data, and that jailbreaking remains inevitable after alignment under reasonable assumptions. It introduces a theoretically grounded framework that decouples prompts into query-concept pairs and uses a PAC-Bayesian bound to link training loss to generalization on harmful prompts. To mitigate jailbreaking, the authors propose E-RLHF, a simple safe-prefix transformation within the RLHF objective (and its E-DPO variant), which expands the safety zone without extra training cost. Empirical results on HarmBench and AdvBench demonstrate safer behavior with maintained or improved usefulness (MT-Bench), suggesting a practical, theoretically motivated path to more robust LLM safety. The work contributes both rigorous insights into the limits of alignment and a concrete method to enhance safety in real-world deployments.

Abstract

Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

TL;DR

This work analyzes jailbreaking of LLMs from a statistical viewpoint, showing that high-quality pretraining will cause models to imitate harmful behavior present in the training data, and that jailbreaking remains inevitable after alignment under reasonable assumptions. It introduces a theoretically grounded framework that decouples prompts into query-concept pairs and uses a PAC-Bayesian bound to link training loss to generalization on harmful prompts. To mitigate jailbreaking, the authors propose E-RLHF, a simple safe-prefix transformation within the RLHF objective (and its E-DPO variant), which expands the safety zone without extra training cost. Empirical results on HarmBench and AdvBench demonstrate safer behavior with maintained or improved usefulness (MT-Bench), suggesting a practical, theoretically motivated path to more robust LLM safety. The work contributes both rigorous insights into the limits of alignment and a concrete method to enhance safety in real-world deployments.

Abstract

Large language models (LLMs) are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.
Paper Structure (39 sections, 9 theorems, 58 equations, 2 figures, 9 tables)

This paper contains 39 sections, 9 theorems, 58 equations, 2 figures, 9 tables.

Key Result

Theorem 1

(PAC-Bayesian Generalization Bound for Language Models.) With $\alpha$ as in Definition def:notion_safety_alignment, consider a set of language models $\mathds {LM}$, with prior distribution $\pi$ over $\mathds {LM}$. Given any $\delta \in (0, 1)$, for any probability measure $\rho$ over $\mathds {L

Figures (2)

  • Figure 1: Our framework in a nutshell: We define a language model, $p_{LM}$: $\rightarrow$, as a map from prompts to a distribution over a subset of all possible explanations $\mathcal{E}$. To later be able to bound the strength of the adversarial attacker, we split the text inputs into concepts and queries $(q,c)$. We assume that (i) the text corpus only covers a part of the domain of the LM: $\mathrm{supp}(D_{\mathcal{P}}) \subsetneq \mathrm{dom}(p_{LM})$, (ii) the size of the domain of the output distribution, denoted $|\mathrm{dom}(p_{LM}(q,c))|$, is small compared to the size of $\mathcal{E}$, and (iii) only concepts determine the output (see ).
  • Figure 2: Conceptual illustration of our framework for jailbreaking introduced in Section \ref{['sec:jailbreak']}, with a fixed harmful concept $c$. The triangle represents the probability simplex. This figure showcases a typical successful jailbreaking attempt by the adversary: although safety alignment makes the sampled LM safe under the direct prompt input, the adversary is able to move the output to the harmful zone $\mathcal{H}_h$ by manipulating the query $q$.

Theorems & Definitions (20)

  • Definition 2.1
  • Definition 3.1
  • Theorem 1
  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Definition 4.4
  • Theorem 2
  • Definition B.1
  • Lemma B.1
  • ...and 10 more