Table of Contents
Fetching ...

Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun

TL;DR

The paper addresses safety degradation during fine-tuning of large language models by identifying entanglement between safety-critical and usefulness-critical gradient directions. It introduces Safety-Aware Probing (SAP), a gradient-propagation framework that injects a safety-aware probe into guidance signals, guided by a contrastive safety objective and a bi-level optimization procedure. SAP reduces harmful outputs below the original fine-tuned levels while preserving or matching typical fine-tuning performance and demonstrates robustness against adversarial attacks, as well as compatibility with existing safety defenses. The method is practical across multiple fine-tuning paradigms and datasets, offering a scalable approach to safer deployment of open-source LLMs.

Abstract

The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.

Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

TL;DR

The paper addresses safety degradation during fine-tuning of large language models by identifying entanglement between safety-critical and usefulness-critical gradient directions. It introduces Safety-Aware Probing (SAP), a gradient-propagation framework that injects a safety-aware probe into guidance signals, guided by a contrastive safety objective and a bi-level optimization procedure. SAP reduces harmful outputs below the original fine-tuned levels while preserving or matching typical fine-tuning performance and demonstrates robustness against adversarial attacks, as well as compatibility with existing safety defenses. The method is practical across multiple fine-tuning paradigms and datasets, offering a scalable approach to safer deployment of open-source LLMs.

Abstract

The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.

Paper Structure

This paper contains 25 sections, 1 theorem, 20 equations, 7 figures, 10 tables, 1 algorithm.

Key Result

theorem A.1

Recall that and In an optimization step for $W$ and $V$ with their step size $\alpha$ and $\beta$, we claim that the gradient direction of $L_{su}$ and $-L_\text{safety}$ are approximately the same. That is:

Figures (7)

  • Figure 1: A brief overview of SAP and its comparison with standard fine-tuning. The key design of SAP lies in perturbing the hidden state with safety-critical directions, which assists in eluding potentially harmful regions during optimization in advance.
  • Figure 2: Loss of model on harmful and useful datasets during the training process. The training dataset is the useful one.
  • Figure 3: The average cosine similarity between useful-critical and harmful-critical ($+\nabla_W L_\text{safety}$) over epochs in fine-tuning on $D_\text{useful}$ (Alpaca alpaca). Each bin on the X-axis represents a layer.
  • Figure 4: Aggregated $L_{su}$ during fine-tuning on Llama-2. The plot shows $\sum_{t=1}^nL_{su}^t$, where $L_{su}^t$ is $L_{su}$ on the $t$-th epoch.
  • Figure 5: Harmful scores during adversarial fine-tuning for reasoning tasks. Results for instruction-following tasks and other reasoning tasks (HellaSwag and Agnews) are in Appendix \ref{['sec:advothertask']}.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 3.1: Contrastive safety loss
  • Definition 3.2: Safety-critical direction
  • theorem A.1: The connection between $L_{su}$ and $L_\text{safety}$
  • proof : proof of theorem \ref{['key theorem']}