Table of Contents
Fetching ...

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche

TL;DR

This work tackles safety alignment degradation that occurs when fine-tuning large language models for domain-specific tasks. It introduces SafeMERGE, a post-fine-tuning defense that selectively merges only the degraded layers of a task-tuned model with corresponding layers from a safety-aligned model, using a cosine-similarity criterion in a per-layer safety subspace defined by $V^i$ and $C^i$. The approach uses a threshold $\tau$ to decide which layers to merge, and a linear merging rule with weight $\alpha$ to balance fine-tuned and safety-aligned information, aiming to preserve utility while reducing harmful outputs. Across three models and two tasks, SafeMERGE consistently yields superior trade-offs between accuracy and safety compared to baselines such as SafeInstruct, RESTA, and SafeLoRA, with ablations highlighting robust performance for $\tau$ around 0.7–0.75 and linear merging. The findings demonstrate that selective, layer-wise post-finetuning merging offers a practical, generalizable safeguard for maintaining safety in adapted LLMs without requiring complex training pipelines or sacrificing performance.

Abstract

Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

TL;DR

This work tackles safety alignment degradation that occurs when fine-tuning large language models for domain-specific tasks. It introduces SafeMERGE, a post-fine-tuning defense that selectively merges only the degraded layers of a task-tuned model with corresponding layers from a safety-aligned model, using a cosine-similarity criterion in a per-layer safety subspace defined by and . The approach uses a threshold to decide which layers to merge, and a linear merging rule with weight to balance fine-tuned and safety-aligned information, aiming to preserve utility while reducing harmful outputs. Across three models and two tasks, SafeMERGE consistently yields superior trade-offs between accuracy and safety compared to baselines such as SafeInstruct, RESTA, and SafeLoRA, with ablations highlighting robust performance for around 0.7–0.75 and linear merging. The findings demonstrate that selective, layer-wise post-finetuning merging offers a practical, generalizable safeguard for maintaining safety in adapted LLMs without requiring complex training pipelines or sacrificing performance.

Abstract

Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.

Paper Structure

This paper contains 52 sections, 12 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: SafeMERGE merges harmful and safe LoRA adapters if the layers deviate from safe behavior, measured by a projection-based cosine similarity.
  • Figure 2: Trade-off between task utility and safety (DirectHarm) for Llama-2 (GSM8K) with varying weights.
  • Figure 3: SafeInstruct utility vs. safety for Llama-2-7B-Chat (GSM8K), evaluated on HexPhi prompts.
  • Figure 4: SafeInstruct utility vs. safety for Llama-3.1-8B-Instruct (GSM8K), evaluated on HexPhi prompts.
  • Figure 5: SafeInstruct utility vs. safety for Qwen-2-7B-Instruct (GSM8K), evaluated on HexPhi prompts.
  • ...and 8 more figures