Table of Contents
Fetching ...

Jailbreak Transferability Emerges from Shared Representations

Rico Angell, Jannik Brinkmann, He He

TL;DR

It is shown that deliberately increasing similarity through benign only distillation causally increases transfer, and reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Abstract

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models' shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Jailbreak Transferability Emerges from Shared Representations

TL;DR

It is shown that deliberately increasing similarity through benign only distillation causally increases transfer, and reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Abstract

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models' shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

Paper Structure

This paper contains 41 sections, 7 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Model similarity causally influences jailbreak transferability. Given a jailbreak that elicits a harmful response from the pink model and a refusal from the dissimilar green model, we can causally influence the transferability to a third model. The purple model is the result of fine-tuning the green model on benign data sampled from the pink model. Distilling on benign data increases the model similarity, and thus, increases the chances the jailbreak transfers to the purple model.
  • Figure 2: Pairwise model similarity is a strong predictor of jailbreak transferability.(left) Each point corresponds to one of the 380 pairs drawn from the 20 open-weight models we study. The blue points represent model pairs where both models are from the same model family and red points represent model pairs where the models are from different families. The x-axis shows representational similarity (mutual $k$-nearest neighbor overlap of hidden representations for 10K Alpaca prompts, $k = 100$); the y-axis shows the symmetric transfer AUROC obtained by averaging the transfer directions of StrongREJECT jailbreaks. We observe that highly similar models never exhibit weak transfer (shaded region). (right) The same data but subsampled to models with 14B parameters or more, with a least-squares fit shown as a purple dashed line. The upward trend confirms a roughly monotonic relationship: models that "think alike" (higher representational similarity) are consistently more vulnerable to the same jailbreaks (higher symmetric transfer AUROC).
  • Figure 3: Distillation on benign data causally increases transferability. In this example, the evil_confidant jailbreak elicits a harmful response from Gemma2-27B and a refusal from Qwen2.5-14. When we fine-tune Qwen2.5-14B on benign prompt--response pairs sampled from Gemma2-27B, the resulting model is susceptible to the evil_confidant jailbreak due to distillation causing an increase in model similarity between the distilled model and Gemma2-27B.
  • Figure 4: Distillation increases representation similarity. Each panel shows the evolution of model similarity (solid line, left y-axis) and training loss (dashed line, right y-axis) across a single epoch of distillation for three teacher student pairs. In all cases, model similarity sharply increases early during training and then plateaus. This suggests that the representational alignment mostly happens early in the distillation process.
  • Figure 5: Distillation improves transferability of passive jailbreaks across models and strength thresholds. Each panel shows the mean transfer score (see Equation \ref{['eq:distill_transfer_metric']} over the course of distillation for jailbreaks on the source model and evaluated on the target model. Lines indicate different strength thresholds $\tau$ used to filter strong source jailbreaks. Across all three teacher-student pairs, distillation improves transferability, particularly at higher thresholds. While these trends to not map one-to-one onto the model similarity curves in Figure \ref{['fig:knn_loss_plots']}, these differences likely arise because representational similarity is measured at one layer, whereas transfer success reflects global model behavior.
  • ...and 17 more figures