Jailbreak Transferability Emerges from Shared Representations

Rico Angell; Jannik Brinkmann; He He

Jailbreak Transferability Emerges from Shared Representations

Rico Angell, Jannik Brinkmann, He He

TL;DR

It is shown that deliberately increasing similarity through benign only distillation causally increases transfer, and reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Abstract

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models' shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Jailbreak Transferability Emerges from Shared Representations

TL;DR

Abstract

Jailbreak Transferability Emerges from Shared Representations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)