Table of Contents
Fetching ...

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

Abhinav Rao, Monojit Choudhury, Somak Aditya

TL;DR

The paper introduces two jailbreak paradoxes for foundation models: (i) no universal, perfect jailbreak classifier can exist, and (ii) weaker models cannot reliably detect jailbroken states of stronger, pareto-dominant models. It presents undecidability-inspired proofs and a formalism for alignment, jailbreaks, and model power, then corroborates the theory with a Tamil-language case study involving LLaMa-2, Tamil-Llama, and GPT-4o. Experimental results show that detection is uneven across models, with GPT-4o providing the most robust jailbreak detection while weaker models struggle, illustrating the practical implications for benchmarking and defense design. The work argues for a proactive defense posture that emphasizes discovering new attack vectors and patching systems, and discusses broader implications for AI safety, cross-language generalization, and related detection challenges in AI systems.

Abstract

We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

TL;DR

The paper introduces two jailbreak paradoxes for foundation models: (i) no universal, perfect jailbreak classifier can exist, and (ii) weaker models cannot reliably detect jailbroken states of stronger, pareto-dominant models. It presents undecidability-inspired proofs and a formalism for alignment, jailbreaks, and model power, then corroborates the theory with a Tamil-language case study involving LLaMa-2, Tamil-Llama, and GPT-4o. Experimental results show that detection is uneven across models, with GPT-4o providing the most robust jailbreak detection while weaker models struggle, illustrating the practical implications for benchmarking and defense design. The work argues for a proactive defense posture that emphasizes discovering new attack vectors and patching systems, and discusses broader implications for AI safety, cross-language generalization, and related detection challenges in AI systems.

Abstract

We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
Paper Structure (14 sections, 2 theorems, 6 figures, 4 tables)

This paper contains 14 sections, 2 theorems, 6 figures, 4 tables.

Key Result

Theorem 3.1

There will always exist LLMs for which there will be no strong jailbreak classifier, where a strong classifier is a classifier achieving arbitrarily good accuracy.

Figures (6)

  • Figure 1: The albert jailbreak in Tamil. All typos have been replicated.
  • Figure 2: Response of Llama-2, Tamil-Llama and GPT-4o for the Albert jailbreak. We can see that Llama-2 misunderstands the query and provdies a refusal for the wrong reason. Tamil-Llama provides detailed instructions in tamil on how to provide firearms to children, and GPT-4o refuses the request.
  • Figure 3: The pliny jailbreak in Tamil. Several key phrases for jailbreaking and code-related symbols and symbols have been left untranslated.
  • Figure 4: Response of Llama-2, Tamil-Llama and GPT-4o for the Albert jailbreak. We can see that Llama-2 does not understand the query at all, Tamil-Llama starts providing the refusal but doesn't get misaligned, but GPT-4o does start speaking in Leetspeak.
  • Figure 5: The codejb jailbreak in Tamil.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Definition 2.1
  • Definition 2.2
  • Theorem 3.1
  • proof
  • Definition 4.1
  • Definition 4.2
  • Theorem 4.1
  • proof