Table of Contents
Fetching ...

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, Liang He

TL;DR

The paper addresses the risk of safety degradation when LLMs undergo harmful fine-tuning in a finetuning-as-a-service setting. It introduces Neuron-Level Safety Realignment (NLSR), a training-free framework that constructs a safety reference model via LoRA-based pre-amplification, identifies safety-critical neurons with a low-rank projection, and selectively patches only the damaged neurons using adaptive layer pruning. Key contributions include a concrete three-step process (safety reference construction, safety-neuron recognition, and neuron-level restoration), plus an adaptive pruning scheme and extensive ablations showing strong safety improvements with minimal task-performance loss across SST-2, AGNEWS, GSM8K, and multiple base models/alignment methods. The results demonstrate practical effectiveness and transferability of safety patches, offering a scalable defense against harmful fine-tuning without additional training. Overall, NLSR provides a targeted, model-agnostic approach to restoring safety in personalized LLMs by transplanting safety-critical neurons from a reference model. All mathematical relations governing the method are formalized, including the safety reference fusion weights $W_{medium}$ and $W_{e}$, neuron identification via truncated SVD, and per-layer similarity-driven pruning, all denoted within $...$.

Abstract

The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training. The code will be available at \url{https://github.com/xinykou/NLSR}

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

TL;DR

The paper addresses the risk of safety degradation when LLMs undergo harmful fine-tuning in a finetuning-as-a-service setting. It introduces Neuron-Level Safety Realignment (NLSR), a training-free framework that constructs a safety reference model via LoRA-based pre-amplification, identifies safety-critical neurons with a low-rank projection, and selectively patches only the damaged neurons using adaptive layer pruning. Key contributions include a concrete three-step process (safety reference construction, safety-neuron recognition, and neuron-level restoration), plus an adaptive pruning scheme and extensive ablations showing strong safety improvements with minimal task-performance loss across SST-2, AGNEWS, GSM8K, and multiple base models/alignment methods. The results demonstrate practical effectiveness and transferability of safety patches, offering a scalable defense against harmful fine-tuning without additional training. Overall, NLSR provides a targeted, model-agnostic approach to restoring safety in personalized LLMs by transplanting safety-critical neurons from a reference model. All mathematical relations governing the method are formalized, including the safety reference fusion weights and , neuron identification via truncated SVD, and per-layer similarity-driven pruning, all denoted within .

Abstract

The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training. The code will be available at \url{https://github.com/xinykou/NLSR}

Paper Structure

This paper contains 48 sections, 18 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: The harmful fine-tuning attack for fine-tuning-as-a-service scenarios and our neuron-level safety realignment approach to mitigate it.
  • Figure 2: A neuron-level safety realignment framework against harmful fine-tuning when adapted to new tasks or domains.
  • Figure 3: The impact of the proportion of safety-critical neurons and the safety alignment methods on the congruence of safe regions following fine-tuning for downstream tasks.
  • Figure 4: (a) The similarity of the safety broken layers identified by the three safety-critical neuron identification methods across different layer pruning rates. (b) The overlap ratio of neurons in the broken layers identified by different methods. The default sparsity rate and pruning rate are 0.7 and 0.5, respectively.
  • Figure 5: The impact of pre-amplification on the model's utility and safety.
  • ...and 6 more figures