Table of Contents
Fetching ...

Jailbreak Attack Initializations as Extractors of Compliance Directions

Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin

TL;DR

Safety-aligned LLMs exhibit distinct compliance and refusal directions in the activation space, enabling jailbreaks. The paper introduces CRI, a clustering-based initialization framework that pre-computes self-transfer initializations and selects the best per prompt using the loss-in-first-step metric $LFS$ to project prompts toward the compliance subspace. Across HarmBench and AdvBench with multiple attacks and models, CRI achieves higher attack success rates with lower computational overhead, and demonstrates cross-dataset transferability, underscoring the fragility of safety-aligned LLMs. These findings motivate robustness evaluations that target alignment vulnerabilities and inform defenses that impede convergence to common compliance directions.

Abstract

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.

Jailbreak Attack Initializations as Extractors of Compliance Directions

TL;DR

Safety-aligned LLMs exhibit distinct compliance and refusal directions in the activation space, enabling jailbreaks. The paper introduces CRI, a clustering-based initialization framework that pre-computes self-transfer initializations and selects the best per prompt using the loss-in-first-step metric to project prompts toward the compliance subspace. Across HarmBench and AdvBench with multiple attacks and models, CRI achieves higher attack success rates with lower computational overhead, and demonstrates cross-dataset transferability, underscoring the fragility of safety-aligned LLMs. These findings motivate robustness evaluations that target alignment vulnerabilities and inform defenses that impede convergence to common compliance directions.

Abstract

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.

Paper Structure

This paper contains 73 sections, 13 equations, 31 figures, 7 tables, 1 algorithm.

Figures (31)

  • Figure 1: Visualization of $CRI$ compared to standard initialization on the $HarmBench$ dataset over the Llama-2 model.
  • Figure 2: Illustration of the $K\text{-}CRI$ framework. First, $N$ harmful prompts are clustered into $K$ groups, and a candidate attack is trained for each cluster (left). When a new prompt arrives, $CRI$ evaluates the LFS of the $K$ attacks and selects the one with the lowest value as initialization for a refined attack (right).
  • Figure 3: Comparison of directions' cosine-similarity during $GCG$'s optimization process on the $HarmBench$ dataset over the Llama-2 model. We compare the refusal with attacks and self-transfer initializations (left), and present the similarity matrices of attacks (center) and initializations (right).
  • Figure 4: Distribution of per-prompt Spearman correlation of $LFS$ and number of steps-to-success, on Llama-2. (Median $r=0.74$, $p=4.61\times10^{-4}$).
  • Figure 5: Comparison of $ASR$ and number of steps on the $HarmBench$ dataset for the $GCG$ (top) and Embedding attacks (bottom) over the Llama-2 (left) and Vicuna (right) models.
  • ...and 26 more figures