Table of Contents
Fetching ...

AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints

Aniruddha Roy, Jyoti Patel, Aman Chadha, Vinija Jain, Amitava Das

TL;DR

This work addresses the alignment drift observed when merging fine-tuned LLM checkpoints. It introduces AlignMerge, a geometry-aware merging framework that operates in a local Fisher chart around an aligned base, coupling Fisher-geodesic proximity to experts with an explicit alignment shield that penalizes motion in an alignment-sensitive subspace and a soft alignment budget based on AQI. Across five model families, AlignMerge achieves higher alignment metrics (AQI, toxicity safeguards, LLM-judge alignment) while preserving or closely matching expert instruction-following and reasoning, and it exhibits reduced drift and budget violations compared with prior methods. The authors frame alignment as a latent-space geometric invariant and position alignment-preserving merging as a reusable, information-geometric composition primitive with potential extensions to multimodal and federated scenarios. While providing strong mid-scale empirical evidence, they acknowledge limitations such as local guarantees, proxy-based metrics, and evaluation scope, suggesting the approach as a robust foundation for geometry-aware model composition in future foundation-model systems.

Abstract

Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize: L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud, where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.

AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints

TL;DR

This work addresses the alignment drift observed when merging fine-tuned LLM checkpoints. It introduces AlignMerge, a geometry-aware merging framework that operates in a local Fisher chart around an aligned base, coupling Fisher-geodesic proximity to experts with an explicit alignment shield that penalizes motion in an alignment-sensitive subspace and a soft alignment budget based on AQI. Across five model families, AlignMerge achieves higher alignment metrics (AQI, toxicity safeguards, LLM-judge alignment) while preserving or closely matching expert instruction-following and reasoning, and it exhibits reduced drift and budget violations compared with prior methods. The authors frame alignment as a latent-space geometric invariant and position alignment-preserving merging as a reusable, information-geometric composition primitive with potential extensions to multimodal and federated scenarios. While providing strong mid-scale empirical evidence, they acknowledge limitations such as local guarantees, proxy-based metrics, and evaluation scope, suggesting the approach as a robust foundation for geometry-aware model composition in future foundation-model systems.

Abstract

Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize: L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud, where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.

Paper Structure

This paper contains 92 sections, 184 equations, 27 figures, 1 table.

Figures (27)

  • Figure 1: Alignment training increases AQI by reshaping the latent geometry of safe vs. unsafe prompts.borah-etal-2025-alignment show pooled activation embeddings of safe (green) and unsafe (red) prompts at successive checkpoints along the alignment fine-tuning trajectory. Early on, safe and unsafe activations are heavily entangled, yielding low AQI in the sense of, which combines the Xie--Beni compactness--separation index and the Calinski--Harabasz dispersion index over latent clusters. As alignment training progresses, intra-class clusters tighten and inter-class distance grows, and AQI rises accordingly. Thus, alignment fine-tuning does not just change surface refusals—it progressively improves the latent cluster structure that AQI measures.
  • Figure 2: Safety fine-tuning amplifies geometric separation between safe and unsafe prompts. Following NEURIPS2024_a9bef53e, we report the mean layerwise separation $\tau(\mathbf{x}, \mu_L^S, \mu_L^U) = \left\| \hat{a}_L^\circ(\mathbf{x})[q] - \mu_L^U \right\|_2 - \left\| \hat{a}_L^\circ(\mathbf{x})[q] - \mu_L^S \right\|_2$, where $\hat{a}_L^\circ(\mathbf{x})[q]$ is the post-GELU MLP activation at position $q$ in layer $L$, and $\mu_L^S,\mu_L^U$ are mean activations over safe vs. unsafe clusters. We show mean $\tau$ across layers 1–6 for: (i) instruction-tuned, (ii) unlearning-tuned ($\eta_M$), and (iii) DPO-tuned ($\eta_M$) models. Green and red denote safe and unsafe completions; larger $\tau$ means stronger separation.
  • Figure 3: Geometric AlignMerge objective. We optimise the merge displacement $\delta\theta$ in a Fisher chart around the base $\theta_{\mathrm{IT}}$. (1) Fisher--geodesic proximity pulls the merged model $\theta = \theta_{\mathrm{IT}} + \delta\theta$ toward expert checkpoints $\{\theta_k\}$ via the Fisher metric $G = F_{\theta_{\mathrm{IT}}}$, recovering a local Riemannian barycenter when other terms vanish. (2) Alignment-subspace Fisher penalty uses the alignment Fisher $F_A = F_{\theta_{\mathrm{IT}}}^{\mathrm{align}}$ and projector $P_A$ onto the alignment subspace $\mathcal{S}_A$ to penalise motion in alignment-critical directions, implementing the "alignment shield" from §\ref{['subsec:lalign']}. (3) Soft alignment budget encodes the constraint $\mathcal{A}(\theta) \ge \mathcal{A}_{\min}$ via a quadratic penalty on violations of the alignment functional $\mathcal{A}$. The summarised form $\mathcal{L}_{\mathrm{AlignMerge}}$ highlights the trade-off between Fisher–geodesic fit to experts, movement in alignment-sensitive directions, and adherence to an alignment budget.
  • Figure 4: Overall performance of AlignMerge vs existing merging schemes. For each model family (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2) we merge a safety-aligned anchor with one or more specialised experts using naive averaging, task-vector / delta arithmetic, Fisher-weighted merging, TIES / sparse-mask merging, SafeMerge, MergeAlign, and AlignMerge. We report alignment & safety metrics (AQI, mean toxicity, toxicity rate, LLM-judge alignment score), utility metrics (instruction-following, reasoning, and helpfulness, plus relative change vs the best expert), and geometric diagnostics (alignment-subspace drift, fraction of alignment-budget violations, and Fisher--geodesic proximity $\mathcal{L}_{\mathrm{geo}}$).
  • Figure 5: Per–model performance of AlignMerge on Mistral 7B. We report alignment and safety metrics (AQI, mean toxicity, toxicity rate, and G-Eval alignment), utility metrics (instruction following, reasoning, helpfulness, and relative utility change $\Delta$Utility vs. the best expert), and geometric diagnostics (alignment-subspace distance $\lVert P_A(\theta-\theta_{\text{SAFE}})\rVert$, fraction of AQI-budget violations, and Fisher–geodesic length $L_{\text{geo}}$) for the instruction-tuned base $\theta_{\text{IT}}$, safety anchor $\theta_{\text{SAFE}}$, two specialised experts, standard merging baselines, SafeMerge, MergeAlign, our full AlignMerge, and two ablations without $L_{\text{align}}$ and $L_{\text{bud}}$. On Mistral 7B, AlignMerge raises AQI from $0.58$ (base) / $0.69$ (safety anchor) to $0.75$, while reducing mean toxicity from $0.11$ to $0.041$ and toxicity rate from $23.5\%$ to $8.9\%$. Budget violations drop from up to $19.2\%$ for $\theta_{\text{SAFE}}$ to $4.9\%$ for AlignMerge, and $L_{\text{geo}}$ contracts from $0.072$ to $0.035$, with instruction and reasoning scores staying within $\approx 0.3$--$0.8$ points of the best expert. Ablations that remove $L_{\text{align}}$ or $L_{\text{bud}}$ show higher budget-violation rates and weaker AQI gains, illustrating the importance of both the alignment-subspace and budget terms.
  • ...and 22 more figures