Table of Contents
Fetching ...

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation

Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Li Shen

TL;DR

Harmful fine-tuning poses a risk to aligned large language models and existing alignment defenses incur high memory costs. The authors introduce Targeted Vaccine (T-Vaccine), which identifies safety-critical layers via harmful gradient norms and perturbs only a subset of layers while freezing the rest, achieving superior defense performance with substantial memory savings. Comparative results show T-Vaccine outperforms Vaccine, TAR, and RepNoise across models and datasets, enabling 7B-scale models on consumer GPUs with much lower memory usage. The work offers a practical, memory-efficient alignment-stage defense with strong robustness to harmful fine-tuning and provides guidance on hyperparameters for layer selection. Potential extensions include applying the approach to multimodal settings and RLHF pipelines.

Abstract

Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while keeping other layers frozen during training. Results show that T-Vaccine outperforms Vaccine in terms of both defense effectiveness and resource efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first defense that can address harmful fine-tuning issues for a 7B pre-trained models trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is available at https://github.com/Lslland/T-Vaccine.

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation

TL;DR

Harmful fine-tuning poses a risk to aligned large language models and existing alignment defenses incur high memory costs. The authors introduce Targeted Vaccine (T-Vaccine), which identifies safety-critical layers via harmful gradient norms and perturbs only a subset of layers while freezing the rest, achieving superior defense performance with substantial memory savings. Comparative results show T-Vaccine outperforms Vaccine, TAR, and RepNoise across models and datasets, enabling 7B-scale models on consumer GPUs with much lower memory usage. The work offers a practical, memory-efficient alignment-stage defense with strong robustness to harmful fine-tuning and provides guidance on hyperparameters for layer selection. Potential extensions include applying the approach to multimodal settings and RLHF pipelines.

Abstract

Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while keeping other layers frozen during training. Results show that T-Vaccine outperforms Vaccine in terms of both defense effectiveness and resource efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first defense that can address harmful fine-tuning issues for a 7B pre-trained models trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is available at https://github.com/Lslland/T-Vaccine.

Paper Structure

This paper contains 23 sections, 6 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: The model's harmful score vs. GPU memory cost. In this case, LLama2-7B is used as the pre-trained model, with a batch size of 10. T-Vaccine-5 and T-Vaccine-8 represent the results with 5 and 8 sampled layers, respectively, showing T-Vaccine's strong ability to handle harmful fine-tuning and its memory efficiency (trainable on a 4090 GPU).
  • Figure 2: Left: Harmful score by adding perturbation to different numbers of layers. Right: The gradient norm of different hidden embedding layers over harmful data.
  • Figure 3: Vaccine vs. T-Vaccine. In contrast to Vaccine, which applies perturbations uniformly across all layers, T-Vaccine first calculates a sampling probability and then randomly selects $\gamma$ security-critical layers in each training step to participate in training and apply perturbations.
  • Figure 4: Memory breakdown of various methods on LLaMa2-7B, Gemma2-2B,Vicuna-7B, and Qwen2-7B