ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Yein Park, Jungwoo Park, Jaewoo Kang
TL;DR
The paper tackles targeted jailbreaking vulnerabilities in LLM safety by revealing a mechanistic, circuit-level bottleneck: tense-sensitive attention heads that trigger past-tense jailbreaks. It introduces ASGuard, a three-stage framework that first locates vulnerable heads through circuit analysis, then applies a channel-wise activation-scaling vector to suppress the harmful pathway, and finally uses Preventative Fine-Tuning to embed a robust refusal behavior. Empirical results across three open-source LLMs show ASGuard significantly reduces ASR on tense jailbreaks while preserving general capabilities, achieving Pareto-optimal safety-utility trade-offs compared to SFT, DPO, and other interventions. The work demonstrates that careful, interpretable interventions at the internals of models can yield practical safety gains with limited collateral damage, and it emphasizes the value of mechanistic interpretability for robust AI safety.
Abstract
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
