Table of Contents
Fetching ...

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao

TL;DR

Panacea tackles harmful fine-tuning by introducing an adaptive post-fine-tuning perturbation learned during fine-tuning via a max-max optimization that increases harmful loss while preserving downstream performance. The inner optimization yields a closed-form perturbation direction within a norm bound, and the outer update steers model parameters to maximize safety without sacrificing fine-tuning accuracy. Across multiple datasets, tasks, and LLMs, Panacea reduces harmful outputs by up to 21.2% on average with only minimal or even slight improvements in fine-tuning accuracy, outperforming existing post-fine-tuning and alignment-stage defenses. The work also reveals layer-wise safety affinities, suggesting targeted defenses for specific layers and providing insights for safety-oriented layer analysis.

Abstract

Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile--with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution--adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety affinity, which coincide with finding from several previous study. Source code available at https://github.com/w-yibo/Panacea.

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

TL;DR

Panacea tackles harmful fine-tuning by introducing an adaptive post-fine-tuning perturbation learned during fine-tuning via a max-max optimization that increases harmful loss while preserving downstream performance. The inner optimization yields a closed-form perturbation direction within a norm bound, and the outer update steers model parameters to maximize safety without sacrificing fine-tuning accuracy. Across multiple datasets, tasks, and LLMs, Panacea reduces harmful outputs by up to 21.2% on average with only minimal or even slight improvements in fine-tuning accuracy, outperforming existing post-fine-tuning and alignment-stage defenses. The work also reveals layer-wise safety affinities, suggesting targeted defenses for specific layers and providing insights for safety-oriented layer analysis.

Abstract

Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile--with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution--adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety affinity, which coincide with finding from several previous study. Source code available at https://github.com/w-yibo/Panacea.

Paper Structure

This paper contains 25 sections, 11 equations, 14 figures, 19 tables, 1 algorithm.

Figures (14)

  • Figure 1: The harmful fine-tuning attack for fine-tuning-as-a-service scenarios. Pretrained LLMs are aligned using alignment data to produce aligned LLMs. Aligned LLMs are further fine-tuned using fine-tuning data that may contain harmful data, leading to unsafe fine-tuned models.
  • Figure 2: Post-fine-tuning perturbation. The fine-tuned model exhibits a high harmful score (HS:$\downarrow$). Adding random perturbation reduces the harmful score but also decreases fine-tuning accuracy (FA:$\uparrow$). In contrast, incorporating our post-fine-tuning perturbation (See Algorithm \ref{['alg:maxmax_optimization']}) effectively lowers the harmful score while maintaining fine-tuning performance.
  • Figure 3: Model statistics (Left: harmful loss of three methods, Right: harmful score of three methods) after fine-tuning on fine-tuning dataset (10% of data is harmful) for different steps.
  • Figure 4: Model statistics (Left: harmful loss, Right: harmful score$\downarrow$ and fine-tuning accuracy$\uparrow$) for fine-tuned model with noise intensities of 0 (no noise), 0.001, 0.01, 0.05, 0.1. (FA of 0.1 is 0.7, and is not shown.)
  • Figure 5: Parameter weights of different LLMs. The parameters in the earlier layers of Llama2-7B (blue) have larger weights, while Gemma2-9B (yellow) and Qwen2-7B (purple) have larger weights in the middle and later layers.
  • ...and 9 more figures