Table of Contents
Fetching ...

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Yein Park, Jungwoo Park, Jaewoo Kang

TL;DR

The paper tackles targeted jailbreaking vulnerabilities in LLM safety by revealing a mechanistic, circuit-level bottleneck: tense-sensitive attention heads that trigger past-tense jailbreaks. It introduces ASGuard, a three-stage framework that first locates vulnerable heads through circuit analysis, then applies a channel-wise activation-scaling vector to suppress the harmful pathway, and finally uses Preventative Fine-Tuning to embed a robust refusal behavior. Empirical results across three open-source LLMs show ASGuard significantly reduces ASR on tense jailbreaks while preserving general capabilities, achieving Pareto-optimal safety-utility trade-offs compared to SFT, DPO, and other interventions. The work demonstrates that careful, interpretable interventions at the internals of models can yield practical safety gains with limited collateral damage, and it emphasizes the value of mechanistic interpretability for robust AI safety.

Abstract

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

TL;DR

The paper tackles targeted jailbreaking vulnerabilities in LLM safety by revealing a mechanistic, circuit-level bottleneck: tense-sensitive attention heads that trigger past-tense jailbreaks. It introduces ASGuard, a three-stage framework that first locates vulnerable heads through circuit analysis, then applies a channel-wise activation-scaling vector to suppress the harmful pathway, and finally uses Preventative Fine-Tuning to embed a robust refusal behavior. Empirical results across three open-source LLMs show ASGuard significantly reduces ASR on tense jailbreaks while preserving general capabilities, achieving Pareto-optimal safety-utility trade-offs compared to SFT, DPO, and other interventions. The work demonstrates that careful, interpretable interventions at the internals of models can yield practical safety gains with limited collateral damage, and it emphasizes the value of mechanistic interpretability for robust AI safety.

Abstract

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Paper Structure

This paper contains 34 sections, 13 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The overview of ASGuard. We first localize jailbreaking-vulnerable attention heads through circuit construction using successful attack cases. After filtering out specific heads only shown within tense vulnerable circuits by comparing them with attack failure circuits, we list up and train the attention head scaling vector which controls activations to be tuned into predefined refusal answer. Lastly, we freeze and attach it into LLMs, and fine-tune model with tense refusal dataset. LLMs can learn more robust refusal action, while preserve general capabilities and minimize over refusal. The scaling vector is no more needed so we detach it to mitigate any other over-boosting of refusal. The result in Table \ref{['table:main-results']} shows that our method successfully decrease attack success rate of targeted jailbreak with more balanced safety-utility trade-off.
  • Figure 2: Safety–Utility Pareto frontier across bases. Each panel plots ASR reduction percent point normalized with the base on $x$ and the R-Score on $y$; points denote methods (icons in legend). Non-dominated sets are connected (solid line). Dashed guide lines indicate Overall scores. ASGuard is labeled; axes and scales are identical across panels.
  • Figure 3: Linear probe analysis result of Llama3.1 8B. (A) refers to the classification accuracy of a linear probe trained on the activations of each identified vulnerable head in Llama3.1 to distinguish between past and present tense. High accuracy confirms these heads specialize in processing tense information. The arrow refers to the accuracy change after ASGuard. (B) refers the distribution of dot product scores between the activation of head L13H25 and its corresponding linear probe vector. The distinct separation for past and present tense prompts confirms the head's specialized function.
  • Figure 4: List of Safety Attention Heads of Llama3.1-8B using Safety Attention Head AttRibution Algorithm (Sahara) zhou2025on. White box refers safety related attention heads found through Sahara. Red colored boxes are targeted jailbreak success cases' heads from "False-to-True" category with EAP-IG circuits, and blue boxes are general jailbreak related heads common in both jailbreak success circuits ("False-to-True") and faled circuits ("Always-False") following §\ref{['subsec:tense_circuit']}. Dashed boxes are tense vulnerable heads, as listed in the Table \ref{['table:tense-heads']}, and especially highlighted heads are important heads which distinguish linguistic past and present tense with more than 50% linear probing accuracy (§\ref{['subsec:linear']}). General jailbreak heads are often overlapped with the list from Sahara, whose main purpose is finding general safety related heads, while it is hard to find out targeted vulnerable heads with the same method.
  • Figure 5: Result of Qwen2.5 7B. (A) refers the classification accuracy of a linear probe trained on the activations of each identified vulnerable head in Llama3.1 to distinguish between past and present tense. High accuracy confirms these heads specialize in processing tense information. The arrow refer to the accuracy change after ASGuard. (B) refers the distribution of dot product scores between the activation of head L13H25 and its corresponding linear probe vector. The distinct separation for past and present tense prompts confirms the head's specialized function.
  • ...and 4 more figures