Table of Contents
Fetching ...

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Xiumin Wang, Li Shen

TL;DR

This work addresses safety risks from harmful fine-tuning by introducing attention sinks and a separable sink-divergence hypothesis. It defines sink divergence $d_h$ and proposes Surgery, a fine-tuning-stage regularizer that minimizes $f(\bm w) + \lambda \frac{1}{|\mathcal H|} \sum_{h} \text{ReLU}(d_h)$ to steer attention heads toward the negative sink-divergence group, thereby reducing the model’s tendency to learn harmful patterns. Empirical results across multiple models, datasets, and benchmarks show that Surgery substantially reduces harmful outputs while preserving task performance, with low overhead relative to baselines. The study highlights the practical potential of leveraging intrinsic attention mechanisms to enhance safety in LLM deployment, and points to early-layer dynamics and robustness as avenues for further improvement.

Abstract

Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis -- \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model's tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90\%, 11.25\%, and 9.55\% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

TL;DR

This work addresses safety risks from harmful fine-tuning by introducing attention sinks and a separable sink-divergence hypothesis. It defines sink divergence and proposes Surgery, a fine-tuning-stage regularizer that minimizes to steer attention heads toward the negative sink-divergence group, thereby reducing the model’s tendency to learn harmful patterns. Empirical results across multiple models, datasets, and benchmarks show that Surgery substantially reduces harmful outputs while preserving task performance, with low overhead relative to baselines. The study highlights the practical potential of leveraging intrinsic attention mechanisms to enhance safety in LLM deployment, and points to early-layer dynamics and robustness as avenues for further improvement.

Abstract

Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis -- \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model's tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90\%, 11.25\%, and 9.55\% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.
Paper Structure (24 sections, 9 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of sink value of two different attention heads. Left /Right: attention heads with high/low sink value.
  • Figure 2: Sink divergence separates two groups of attention heads.
  • Figure 3: Illustration of the relationship between the two groups of attention heads and model safety. Left, increasing the harmful ratio increases the model's harmfulness. Middle, increasing the harmful ratio shifts attention heads from the negative sink divergence group toward the positive sink divergence group (e.g., for Lisa, the number of heads with sink divergence $> 0$ increases from 553 to 580 as the harmful ratio rises from 0 to 0.5). Right, disabling attention heads with positive sink divergence suppresses the model's harmfulness.
  • Figure 4: The proposed Surgery performs sink divergence suppression during the fine-tuning stage, steering attention heads toward the sink divergence $< 0$ group.
  • Figure 5: The sink divergence of each attention head. Left: before Surgery training. Right: After Surgery training.
  • ...and 5 more figures