Table of Contents
Fetching ...

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu

TL;DR

Antidote introduces a post-fine-tuning realignment via one-shot pruning to combat harmful fine-tuning attacks on safety-aligned LLMs, addressing the hyper-parameter sensitivity of prior defenses. By computing the Wanda score on a realignment dataset, it identifies and removes harmful parameters, yielding robust HS reductions with only modest FA loss across multiple models and datasets and minimal system overhead. The approach is validated through extensive experiments, including generalizations to different tasks, datasets, and model sizes, and is shown to be complementary to existing defense strategies. This work provides a practical, hyper-parameter-agnostic defense suitable for real-world fine-tuning services.

Abstract

Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Code is available at https://github.com/git-disl/Antidote.

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

TL;DR

Antidote introduces a post-fine-tuning realignment via one-shot pruning to combat harmful fine-tuning attacks on safety-aligned LLMs, addressing the hyper-parameter sensitivity of prior defenses. By computing the Wanda score on a realignment dataset, it identifies and removes harmful parameters, yielding robust HS reductions with only modest FA loss across multiple models and datasets and minimal system overhead. The approach is validated through extensive experiments, including generalizations to different tasks, datasets, and model sizes, and is shown to be complementary to existing defense strategies. This work provides a practical, hyper-parameter-agnostic defense suitable for real-world fine-tuning services.

Abstract

Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Code is available at https://github.com/git-disl/Antidote.
Paper Structure (18 sections, 3 equations, 6 figures, 14 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Antidote with a three-stage pipeline, i.e., i) safety alignment, ii) user fine-tuning, iii) one-shot pruning. While existing defenses focus on the first stage, e.g., huang2024vaccinerosati2024representation or the second stage huang2024lazymukhoti2023fine, Antidote utilizes the post-fine-tuning stage to prune the harmful weights to recover the model from harmful behaviors.
  • Figure 2: Harmful score and finetune accuracy with different learning rates after fine-tuning. Here we fix fine-tuning epochs to 20.
  • Figure 3: Harmful score and finetune accuracy with different fine-tuning epochs after user fine-tuning. Here we fix fine-tuning learning rate to 1e-5.
  • Figure 4: Detailed procedure of Antidote. On Stage III after model has been fine-tuned, Antidote extracts the importance masks over realignment dataset. Then this mask is applied to purify the harmful fine-tuned model.
  • Figure 5: Harmful embedding drift (HED) under different learning rate and epochs in fine-tuning stage. Antidote obtains a relatively small HED.
  • ...and 1 more figures