Table of Contents
Fetching ...

Jailbreaking is (Mostly) Simpler Than You Think

Mark Russinovich, Ahmed Salem

TL;DR

The paper addresses how safety mechanisms in AI systems can be bypassed by exploiting context rather than prompts. It introduces Context Compliance Attack (CCA), an optimization-free method that manipulates conversation history to trigger unsafe outputs. Through broad experiments across numerous models, the authors show that most systems are vulnerable, with some resistance observed (e.g., Llama-2), and that the vulnerability can compound after initial breaches. They propose server-side history management and cryptographic history signatures as mitigations and call for stronger context integrity validation in future AI safety research.

Abstract

We introduce the Context Compliance Attack (CCA), a novel, optimization-free method for bypassing AI safety mechanisms. Unlike current approaches -- which rely on complex prompt engineering and computationally intensive optimization -- CCA exploits a fundamental architectural vulnerability inherent in many deployed AI systems. By subtly manipulating conversation history, CCA convinces the model to comply with a fabricated dialogue context, thereby triggering restricted behavior. Our evaluation across a diverse set of open-source and proprietary models demonstrates that this simple attack can circumvent state-of-the-art safety protocols. We discuss the implications of these findings and propose practical mitigation strategies to fortify AI systems against such elementary yet effective adversarial tactics.

Jailbreaking is (Mostly) Simpler Than You Think

TL;DR

The paper addresses how safety mechanisms in AI systems can be bypassed by exploiting context rather than prompts. It introduces Context Compliance Attack (CCA), an optimization-free method that manipulates conversation history to trigger unsafe outputs. Through broad experiments across numerous models, the authors show that most systems are vulnerable, with some resistance observed (e.g., Llama-2), and that the vulnerability can compound after initial breaches. They propose server-side history management and cryptographic history signatures as mitigations and call for stronger context integrity validation in future AI safety research.

Abstract

We introduce the Context Compliance Attack (CCA), a novel, optimization-free method for bypassing AI safety mechanisms. Unlike current approaches -- which rely on complex prompt engineering and computationally intensive optimization -- CCA exploits a fundamental architectural vulnerability inherent in many deployed AI systems. By subtly manipulating conversation history, CCA convinces the model to comply with a fabricated dialogue context, thereby triggering restricted behavior. Our evaluation across a diverse set of open-source and proprietary models demonstrates that this simple attack can circumvent state-of-the-art safety protocols. We discuss the implications of these findings and propose practical mitigation strategies to fortify AI systems against such elementary yet effective adversarial tactics.

Paper Structure

This paper contains 10 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Example of a real CCA conversation for constructing a pipe bomb using the Phi-4 model.