[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Abhinav Rao; Monojit Choudhury; Somak Aditya

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Abhinav Rao, Monojit Choudhury, Somak Aditya

TL;DR

The paper introduces two jailbreak paradoxes for foundation models: (i) no universal, perfect jailbreak classifier can exist, and (ii) weaker models cannot reliably detect jailbroken states of stronger, pareto-dominant models. It presents undecidability-inspired proofs and a formalism for alignment, jailbreaks, and model power, then corroborates the theory with a Tamil-language case study involving LLaMa-2, Tamil-Llama, and GPT-4o. Experimental results show that detection is uneven across models, with GPT-4o providing the most robust jailbreak detection while weaker models struggle, illustrating the practical implications for benchmarking and defense design. The work argues for a proactive defense posture that emphasizes discovering new attack vectors and patching systems, and discusses broader implications for AI safety, cross-language generalization, and related detection challenges in AI systems.

Abstract

We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

TL;DR

Abstract

Paper Structure (14 sections, 2 theorems, 6 figures, 4 tables)

This paper contains 14 sections, 2 theorems, 6 figures, 4 tables.

Introduction
Background, Definitions and Formalism
Paradox 1: The Impossibility of Perfect Jailbreak Classifiers
Paradox 2: Jailbreaks of Stronger (Pareto-Dominant) Models can not be detected by Weaker Ones
A Case Study on Tamil Jailbreaks
Experimental results
Discussion
Existing Jailbreak Evaluation and Defense Techniques.
Ensuring Safety and the Future of LLM Jailbreak Detection.
Existence of Other Equivalent Paradoxes.
On the choice of Jailbreaks, Languages, and Models
Appendix
Scores on all categories for Tamil-LLaMa-Eval
Tamil Jailbreaks and model responses

Key Result

Theorem 3.1

There will always exist LLMs for which there will be no strong jailbreak classifier, where a strong classifier is a classifier achieving arbitrarily good accuracy.

Figures (6)

Figure 1: The albert jailbreak in Tamil. All typos have been replicated.
Figure 2: Response of Llama-2, Tamil-Llama and GPT-4o for the Albert jailbreak. We can see that Llama-2 misunderstands the query and provdies a refusal for the wrong reason. Tamil-Llama provides detailed instructions in tamil on how to provide firearms to children, and GPT-4o refuses the request.
Figure 3: The pliny jailbreak in Tamil. Several key phrases for jailbreaking and code-related symbols and symbols have been left untranslated.
Figure 4: Response of Llama-2, Tamil-Llama and GPT-4o for the Albert jailbreak. We can see that Llama-2 does not understand the query at all, Tamil-Llama starts providing the refusal but doesn't get misaligned, but GPT-4o does start speaking in Leetspeak.
Figure 5: The codejb jailbreak in Tamil.
...and 1 more figures

Theorems & Definitions (8)

Definition 2.1
Definition 2.2
Theorem 3.1
proof
Definition 4.1
Definition 4.2
Theorem 4.1
proof

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

TL;DR

Abstract

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)