Table of Contents
Fetching ...

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

TL;DR

This work identifies a safety-vulnerability phenomenon in LLM fine-tuning, revealing a narrow safety basin in parameter space in which updates along the alignment direction $d_{\text{aligned}}$ preserve safety while orthogonal updates $d^{\perp}_{\text{harm}}$ quickly degrade it. To exploit this, AsFT penalizes updates orthogonal to the alignment direction, anchoring fine-tuning within the safety basin without sacrificing performance. Across eight datasets and four models, AsFT achieves a favorable safety-performance balance, reducing harmful outputs and maintaining or improving task accuracy, and demonstrates robustness to datasets, sample sizes, and learning-rate variations. The approach generalizes to full-parameter tuning and is adaptable to open-world deployment with different vendor model configurations, offering practical defenses against harmful fine-tuning attacks and jailbreak attempts. Overall, AsFT provides a data-free, continuous optimization strategy that leverages the latent alignment direction to stabilize safety during LLM fine-tuning.

Abstract

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

TL;DR

This work identifies a safety-vulnerability phenomenon in LLM fine-tuning, revealing a narrow safety basin in parameter space in which updates along the alignment direction preserve safety while orthogonal updates quickly degrade it. To exploit this, AsFT penalizes updates orthogonal to the alignment direction, anchoring fine-tuning within the safety basin without sacrificing performance. Across eight datasets and four models, AsFT achieves a favorable safety-performance balance, reducing harmful outputs and maintaining or improving task accuracy, and demonstrates robustness to datasets, sample sizes, and learning-rate variations. The approach generalizes to full-parameter tuning and is adaptable to open-world deployment with different vendor model configurations, offering practical defenses against harmful fine-tuning attacks and jailbreak attempts. Overall, AsFT provides a data-free, continuous optimization strategy that leverages the latent alignment direction to stabilize safety during LLM fine-tuning.

Abstract

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

Paper Structure

This paper contains 30 sections, 2 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: (a) The Safety Basin peng2024navigating shows a region where perturbations along $d_{\text{random}}$ preserve model safety, while safety sharply declines outside this area. (b) The Narrow Safety Basin demonstrates the asymmetry between $d_{\text{aligned}}$ and $d_{\text{harm}}$, where $d_{\text{aligned}}$ allows larger perturbations, while $d_{\text{harm}}$ causes sharp safety declines. In both subfigures, lower values indicate higher safety.
  • Figure 2: The proposed AsFT decomposes parameter updates into $d_{\text{aligned}}$ and $d^{\perp}_{\text{harm}}$, suppresses harmful updates along $d^{\perp}_{\text{harm}}$ by regularization and constrains updates within the narrow safety basin.
  • Figure 3: Safety landscape of Qwen-2-7B (left) and Gemma-2-9B (right) anchored along $d_{\text{aligned}}$.
  • Figure 4: (a) Restricting updates along $d^{\perp}_{\text{harm}}$ (AsFT) significantly reduces harmful scores as $\lambda$ increases, while maintaining fine-tuning accuracy. (b) Restricting updates along $d_{\text{aligned}}$ results in consistently high harmful scores. (c) Comparison of robustness to learning rate variations shows that AsFT achieves a broader effective range compared to data-driven methods (SafeInstr bianchi2023safety and BEA wangbackdooralign).
  • Figure 5: Safety landscape of Qwen-2-7B (left) and Gemma-2-9B (right) anchored along $d_{\text{aligned}}$.
  • ...and 1 more figures