Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Ling Liu

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

TL;DR

This work identifies harmful perturbation during fine-tuning as a key factor causing alignment breakdown in LLMs under harmful fine-tuning attacks. It proposes Booster, an alignment-stage regularizer that minimizes the harmful loss reduction after a simulated harmful gradient step, solved via a three-pass iterative gradient method. Empirical results across multiple models and tasks show Booster substantially lowers harmful scores while preserving downstream finetune accuracy, and it can be combined with existing defenses like Vaccine or Lisa. The approach offers a practical, one-time alignment-stage defense for fine-tuning-as-a-service with broad applicability and potential for integration with other safety mechanisms.

Abstract

Harmful fine-tuning attack poses serious safety concerns for large language models' fine-tuning-as-a-service. While existing defenses have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. To this end, we in this paper show that harmful perturbation over the model weights could be a probable cause of alignment-broken. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction after the simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at https://github.com/git-disl/Booster.

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

TL;DR

Abstract

Paper Structure (28 sections, 8 equations, 6 figures, 17 tables, 2 algorithms)

This paper contains 28 sections, 8 equations, 6 figures, 17 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Harmful fine-tuning
Harmful Perturbation
Methodology
Experiment
Setup
Main Experiments
Statistical/System Analysis
Hyper-parameter Analysis
Alternative Design
Visualization
Conclusion
Ethics statement
...and 13 more sections

Figures (6)

Figure 1: A common two-stage pipeline for fine-tuning-as-a-service. Fine-tuning on harmful user data on Stage ② compromises alignment performance. Our proposed solution optimizes over Stage ①, which jointly utilizes the alignment dataset and harmful dataset to vaccinate the model such that it is robust to the later fine-tuning attack.
Figure 2: Model statistics (Left: harmful score, Middle: harmful training loss, Right: harmful testing loss) after fine-tuning on pure SST2/harmful data for different steps. Specially, harmful score measures how harmful the model is (the smaller the better), harmful training loss refers to the loss over the harmful data used in fine-tuning, while harmful testing loss refers to that over the testing harmful data that the model never sees in fine-tuning stage.
Figure 3: Model Statistics (Left: harmful score, Middle: harmful training loss, Right: harmful testing loss) after fine-tuning on 10% of harmful data for different steps. Specially, harmful training loss refers to the loss over the harmful data used in training, while harmful testing loss refers to that over the testing harmful data which the model never see in fine-tuning phase.
Figure 4: Model statistics (Left: harmful score, Middle: harmful training loss, Right: harmful testing loss) after fine-tuning on pure SST2/harmful data for different steps. Specially, harmful training loss refers to the loss over the harmful data used in training, while harmful testing loss refers to that over the testing harmful data that the model never sees in fine-tuning stage.
Figure 5: Model Statistics (Left: harmful score, Middle: harmful training loss, Right: harmful testing loss) after fine-tuning on 10% of harmful data for different steps. Specially, harmful training loss refers to the loss over the harmful data used in fine-tuning, while harmful testing loss refers to that over the testing harmful data which the model never see in fine-tuning phase.
...and 1 more figures

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

TL;DR

Abstract

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)