Table of Contents
Fetching ...

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

Di Wu, Xin Lu, Yanyan Zhao, Bing Qin

TL;DR

This work addresses safety drift in fine-tuned LLMs by introducing IRR, a post-hoc method that identifies unsafe delta parameters via a safety vector and Fisher-based importance, removes those deltas, and recalibrates retained parameters using Hessian-based compensation. IRR achieves Pareto improvements by increasing safety while largely preserving downstream task performance across full and LoRA fine-tuning on multiple models and datasets. The approach is model-agnostic, demonstrates cross-language safety gains, and remains effective under harmful-fine-tuning scenarios, offering a practical, scalable solution for maintaining safety in deployed LLMs.

Abstract

Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at: https://anonymous.4open.science/r/IRR-BD4F.

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

TL;DR

This work addresses safety drift in fine-tuned LLMs by introducing IRR, a post-hoc method that identifies unsafe delta parameters via a safety vector and Fisher-based importance, removes those deltas, and recalibrates retained parameters using Hessian-based compensation. IRR achieves Pareto improvements by increasing safety while largely preserving downstream task performance across full and LoRA fine-tuning on multiple models and datasets. The approach is model-agnostic, demonstrates cross-language safety gains, and remains effective under harmful-fine-tuning scenarios, offering a practical, scalable solution for maintaining safety in deployed LLMs.

Abstract

Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named IRR (Identify, Remove, and Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at: https://anonymous.4open.science/r/IRR-BD4F.

Paper Structure

This paper contains 47 sections, 7 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: The illustration presents post-hoc approaches for safety realignment. Our method, IRR, first identifies and removes unsafe delta parameters, then recalibrates the remaining ones.
  • Figure 2: During fine-tuning phase, safety-aligned models acquire delta parameters that enhance downstream task performance, but these parameters may compromise model safety. In the post-hoc phase, IRR carefully identifies and removes unsafe delta parameters. It then computes compensatory values and adds them to the retained parameters, effectively restoring safety while preserving the model performance on downstream tasks.
  • Figure 3: We show the trend of “downstream task performance vs. safety score” based on the Harmful Benchmark. Our method, IRR, outperforms baseline methods, maintaining downstream task performance as safety improves.
  • Figure 4: We show the trend of “downstream task performance vs. safety score” based on the Jailbreak Attack. Our method, IRR, outperforms baseline methods, maintaining downstream task performance as safety improves.
  • Figure 5: We present the results of the IRR ablation study using “downstream task performance vs. safety” curves.
  • ...and 10 more figures