Table of Contents
Fetching ...

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Quoc Minh Nguyen, Trung Le, Jing Wu, Anh Tuan Bui, Mehrtash Harandi

TL;DR

Antibody is introduced, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning to effectively mitigate the impact of harmful fine-tuning attacks.

Abstract

Fine-tuning-as-a-service introduces a threat to Large Language Models' safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning algorithm that applies a weighting scheme to all samples in each training batch to inhibit the model from learning from harmful samples while encouraging learning from benign samples. Experimental results demonstrate that Antibody successfully mitigates harmful fine-tuning attacks while boosting fine-tuning performance on the user-submitted dataset.

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

TL;DR

Antibody is introduced, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning to effectively mitigate the impact of harmful fine-tuning attacks.

Abstract

Fine-tuning-as-a-service introduces a threat to Large Language Models' safety when service providers fine-tune their models on poisoned user-submitted datasets, a process known as harmful fine-tuning attacks. In this work, we show that by regularizing the gradient contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks. To this end, we introduce Antibody, a defense strategy that first ensures robust safety alignment for the model before fine-tuning, and then applies a safety-preservation learning algorithm during fine-tuning. Specifically, in the alignment stage before fine-tuning, we propose optimizing the model to be in a flat loss region with respect to harmful samples, which makes the safety alignment more resilient to subsequent harmful fine-tuning. Then, in the fine-tuning stage, we design a fine-tuning algorithm that applies a weighting scheme to all samples in each training batch to inhibit the model from learning from harmful samples while encouraging learning from benign samples. Experimental results demonstrate that Antibody successfully mitigates harmful fine-tuning attacks while boosting fine-tuning performance on the user-submitted dataset.
Paper Structure (36 sections, 3 theorems, 21 equations, 5 figures, 13 tables)

This paper contains 36 sections, 3 theorems, 21 equations, 5 figures, 13 tables.

Key Result

Theorem 4.1

The optimal solution to the optimization problem in eq:delta_opt is $\delta_t^{*} = \nabla_{\theta} {\mathcal{L}}_{\text{align}}(\theta_t) + \lambda_{t} \nabla_{\theta} {\mathcal{L}}_{\text{sharp}}(\theta_{t})$, where $\lambda_{t} = \operatorname{max} \left\{0, \frac{a_{t} - \nabla_{\theta} {\mathca

Figures (5)

  • Figure 1: Fine-tuning on GSM8K cobbe2021training with varying sample sizes and a fixed harmful ratio of $20\%$. Larger sample sizes improve fine-tuning accuracy (higher FA) but degrade model safety (higher HS).
  • Figure 2: The effect of our proposed fine-tuning method. Left and middle plots show the score ($r_{\theta}$) distribution before and after fine-tuning, while the right plot compares the fine-tuning loss of our method (Antibody) against SFT on benign and harmful samples in the fine-tuning dataset.
  • Figure 3: Harmful score with different fine-tuning epochs (Left) and learning rates (Right).
  • Figure 4: Effect of our flatness-regularized alignment from \ref{['sec:robust_align_flat']}. We plot the distribution of per-sample gradient norms at the beginning of the fine-tuning stage (before fine-tuning) for models aligned with Antibody's alignment-stage solution (left) and with standard SFT (right).
  • Figure 5: The effect of our proposed fine-tuning method under a harmful fine-tuning attack with OOD harmful data. The plots illustrate the distribution of the score $r_\theta$ (the pre-normalized weight) in \ref{['eq:weight_formula']} for benign samples versus harmful samples before and after fine-tuning on a dataset poisoned with PureBad qi2024finetuning.

Theorems & Definitions (5)

  • Theorem 4.1
  • Proposition 4.2
  • Proposition 4.3
  • proof
  • proof