Alleviating the Fear of Losing Alignment in LLM Fine-tuning
Kang Yang, Guanhong Tao, Xun Chen, Jun Xu
TL;DR
This work investigates how fine-tuning can erode the alignment of large language models (LLMs) and introduces a direction-based alignment-recovery method. By identifying and restoring a small subset of weights from the original aligned model to the fine-tuned model, guided by the harmful direction ${oldsymbol abla}_{harmful}$, the approach restores the model’s tendency to refuse harmful prompts while preserving downstream task performance. A gradient-guided, sparse restoration algorithm with a rollback mechanism achieves substantial alignment recovery, reducing harmful responses from around 33% to under 2% with minimal utility loss (about 2-3%), outperforming SoftSFT and RESTA baselines. The method generalizes across diverse models, datasets, and even newer architectures, and is demonstrated under two practical scenarios, including adversarial fine-tuning, highlighting its practical impact for safer, more reliable LLM deployment.
Abstract
Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the \textit{aligned direction} and the \textit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25\% to 1.74\%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment
