Jailbreak Attack Initializations as Extractors of Compliance Directions
Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin
TL;DR
Safety-aligned LLMs exhibit distinct compliance and refusal directions in the activation space, enabling jailbreaks. The paper introduces CRI, a clustering-based initialization framework that pre-computes self-transfer initializations and selects the best per prompt using the loss-in-first-step metric $LFS$ to project prompts toward the compliance subspace. Across HarmBench and AdvBench with multiple attacks and models, CRI achieves higher attack success rates with lower computational overhead, and demonstrates cross-dataset transferability, underscoring the fragility of safety-aligned LLMs. These findings motivate robustness evaluations that target alignment vulnerabilities and inform defenses that impede convergence to common compliance directions.
Abstract
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.
