Jailbreaking is (Mostly) Simpler Than You Think
Mark Russinovich, Ahmed Salem
TL;DR
The paper addresses how safety mechanisms in AI systems can be bypassed by exploiting context rather than prompts. It introduces Context Compliance Attack (CCA), an optimization-free method that manipulates conversation history to trigger unsafe outputs. Through broad experiments across numerous models, the authors show that most systems are vulnerable, with some resistance observed (e.g., Llama-2), and that the vulnerability can compound after initial breaches. They propose server-side history management and cryptographic history signatures as mitigations and call for stronger context integrity validation in future AI safety research.
Abstract
We introduce the Context Compliance Attack (CCA), a novel, optimization-free method for bypassing AI safety mechanisms. Unlike current approaches -- which rely on complex prompt engineering and computationally intensive optimization -- CCA exploits a fundamental architectural vulnerability inherent in many deployed AI systems. By subtly manipulating conversation history, CCA convinces the model to comply with a fabricated dialogue context, thereby triggering restricted behavior. Our evaluation across a diverse set of open-source and proprietary models demonstrates that this simple attack can circumvent state-of-the-art safety protocols. We discuss the implications of these findings and propose practical mitigation strategies to fortify AI systems against such elementary yet effective adversarial tactics.
