[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Abhinav Rao, Monojit Choudhury, Somak Aditya
TL;DR
The paper introduces two jailbreak paradoxes for foundation models: (i) no universal, perfect jailbreak classifier can exist, and (ii) weaker models cannot reliably detect jailbroken states of stronger, pareto-dominant models. It presents undecidability-inspired proofs and a formalism for alignment, jailbreaks, and model power, then corroborates the theory with a Tamil-language case study involving LLaMa-2, Tamil-Llama, and GPT-4o. Experimental results show that detection is uneven across models, with GPT-4o providing the most robust jailbreak detection while weaker models struggle, illustrating the practical implications for benchmarking and defense design. The work argues for a proactive defense posture that emphasizes discovering new attack vectors and patching systems, and discusses broader implications for AI safety, cross-language generalization, and related detection challenges in AI systems.
Abstract
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
