Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Hasan Abed Al Kader Hammoud; Umberto Michieli; Fabio Pizzati; Philip Torr; Adel Bibi; Bernard Ghanem; Mete Ozay

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay

TL;DR

This work addresses the risk that merging domain-focused LLMs can propagate safety misalignment. It introduces a two-stage safety-aware merging pipeline that generates synthetic safety data and domain data to optimize alignment alongside domain performance through data-driven task weighting (EvoMM and LM-Cocktail). The approach demonstrates that incorporating alignment data during merging yields merged models with high safety alignment and competitive or superior domain accuracy across multiple benchmarks, including beyond-two-model scenarios. While promising, the study also clarifies limitations and calls for careful consideration of alignment requirements and prompt-template constraints in real-world deployments.

Abstract

Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

TL;DR

Abstract

Paper Structure (37 sections, 6 equations, 5 figures, 5 tables)

This paper contains 37 sections, 6 equations, 5 figures, 5 tables.

Introduction
Related work
LLM Alignment
Model Merging
Alignment Evaluation
Preliminaries
Background on Model Merging
Automatic Task Weighting
Safety-Aware Merging
Motivation
Safety Data Generation
Domain Data Generation
Merging
Experiments
Experimental Setup
...and 22 more sections

Figures (5)

Figure 1: Safety-aware merging. Traditional LLM merging techniques can create multi-domain expert models but often transfer misalignment to the merged model. Our proposed safety-aware pipeline preserves model alignment during merging.
Figure 2: Data generation. We generate both safety data $\mathcal{D}_\text{safety}$ (top) and expert domain data $\mathcal{D}_\text{expert}$ (bottom). For safety data, we use an uncensored LLM to generate harmful questions, and collect refusals of the $\mathcal{F}$ experts with LLaMA-Guard metallamaguard2. For domain data, we use the $\mathcal{F}$ experts to generate questions in different domains (self-questioning) and collect responses.
Figure 3: Varying loss combination factor $\alpha$. For $\alpha\le0.5$, merging yields good results in both accuracy and alignment. For greater $\alpha$ (e.g.1.0), alignment degrades significantly while accuracy does not improve.
Figure 4: Domain data prompt. Prompt employed for domain-specific data generation $\mathcal{D}_\text{expert}$.
Figure 5: Alignment data prompt. Prompt employed for alignment data generation $\mathcal{D}_\text{safety}$.

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

TL;DR

Abstract

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Authors

TL;DR

Abstract

Table of Contents

Figures (5)