Table of Contents
Fetching ...

Alleviating the Fear of Losing Alignment in LLM Fine-tuning

Kang Yang, Guanhong Tao, Xun Chen, Jun Xu

TL;DR

This work investigates how fine-tuning can erode the alignment of large language models (LLMs) and introduces a direction-based alignment-recovery method. By identifying and restoring a small subset of weights from the original aligned model to the fine-tuned model, guided by the harmful direction ${oldsymbol abla}_{harmful}$, the approach restores the model’s tendency to refuse harmful prompts while preserving downstream task performance. A gradient-guided, sparse restoration algorithm with a rollback mechanism achieves substantial alignment recovery, reducing harmful responses from around 33% to under 2% with minimal utility loss (about 2-3%), outperforming SoftSFT and RESTA baselines. The method generalizes across diverse models, datasets, and even newer architectures, and is demonstrated under two practical scenarios, including adversarial fine-tuning, highlighting its practical impact for safer, more reliable LLM deployment.

Abstract

Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the \textit{aligned direction} and the \textit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25\% to 1.74\%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment

Alleviating the Fear of Losing Alignment in LLM Fine-tuning

TL;DR

This work investigates how fine-tuning can erode the alignment of large language models (LLMs) and introduces a direction-based alignment-recovery method. By identifying and restoring a small subset of weights from the original aligned model to the fine-tuned model, guided by the harmful direction , the approach restores the model’s tendency to refuse harmful prompts while preserving downstream task performance. A gradient-guided, sparse restoration algorithm with a rollback mechanism achieves substantial alignment recovery, reducing harmful responses from around 33% to under 2% with minimal utility loss (about 2-3%), outperforming SoftSFT and RESTA baselines. The method generalizes across diverse models, datasets, and even newer architectures, and is demonstrated under two practical scenarios, including adversarial fine-tuning, highlighting its practical impact for safer, more reliable LLM deployment.

Abstract

Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the \textit{aligned direction} and the \textit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25\% to 1.74\%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment

Paper Structure

This paper contains 45 sections, 8 equations, 7 figures, 19 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of LLM alignment. LLM alignment ensures that the model’s outputs align with human values.
  • Figure 2: Workflow of our method.
  • Figure 3: Alignment recovery results of our method. The y-axis represents the harmful rate and the x-axis shows the number of harmful samples injected into the fine-tuning process. In the figures, Fine-tuned stands for the fine-tuned model, Recovered refers to the alignment-recovered model, and Original denotes the alignment of aligned model.
  • Figure 4: Visualization of the trade-off between harmful rate and task performance. FT represents the original fine-tuned model, and SoftSFT and RESTA stand for the two baselines. In the figure, we merge all the different models together. Each node represents a unique combination of the target model, the fine-tuning dataset, and the number of injected harmful prompts. The right side of the x-axis means better task performance and the upper side of the y-axis means lower harmful rate. Hence, the upper, right corner represents the optimal trade-off.
  • Figure 5: The alignment recovery results and the time cost of our method when using a direction layer at various positions. $\frac{2}{6}$, $\frac{3}{6}$, $\frac{4}{6}$, and $\frac{5}{6}$ on the x-axis respectively represent that we pick the direction layer from the $\frac{2}{6}$, $\frac{3}{6}$, $\frac{4}{6}$, and $\frac{5}{6}$ position of the hidden layers. We omitted the value of time cost for Qwen 7B on the CHEAT task at the $\frac{3}{6}$ position, as it activates rollback and disproportionately increases the time expenditure.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: Direction