NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, Liang He
TL;DR
The paper addresses the risk of safety degradation when LLMs undergo harmful fine-tuning in a finetuning-as-a-service setting. It introduces Neuron-Level Safety Realignment (NLSR), a training-free framework that constructs a safety reference model via LoRA-based pre-amplification, identifies safety-critical neurons with a low-rank projection, and selectively patches only the damaged neurons using adaptive layer pruning. Key contributions include a concrete three-step process (safety reference construction, safety-neuron recognition, and neuron-level restoration), plus an adaptive pruning scheme and extensive ablations showing strong safety improvements with minimal task-performance loss across SST-2, AGNEWS, GSM8K, and multiple base models/alignment methods. The results demonstrate practical effectiveness and transferability of safety patches, offering a scalable defense against harmful fine-tuning without additional training. Overall, NLSR provides a targeted, model-agnostic approach to restoring safety in personalized LLMs by transplanting safety-critical neurons from a reference model. All mathematical relations governing the method are formalized, including the safety reference fusion weights $W_{medium}$ and $W_{e}$, neuron identification via truncated SVD, and per-layer similarity-driven pruning, all denoted within $...$.
Abstract
The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training. The code will be available at \url{https://github.com/xinykou/NLSR}
