AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints
Aniruddha Roy, Jyoti Patel, Aman Chadha, Vinija Jain, Amitava Das
TL;DR
This work addresses the alignment drift observed when merging fine-tuned LLM checkpoints. It introduces AlignMerge, a geometry-aware merging framework that operates in a local Fisher chart around an aligned base, coupling Fisher-geodesic proximity to experts with an explicit alignment shield that penalizes motion in an alignment-sensitive subspace and a soft alignment budget based on AQI. Across five model families, AlignMerge achieves higher alignment metrics (AQI, toxicity safeguards, LLM-judge alignment) while preserving or closely matching expert instruction-following and reasoning, and it exhibits reduced drift and budget violations compared with prior methods. The authors frame alignment as a latent-space geometric invariant and position alignment-preserving merging as a reusable, information-geometric composition primitive with potential extensions to multimodal and federated scenarios. While providing strong mid-scale empirical evidence, they acknowledge limitations such as local guarantees, proxy-based metrics, and evaluation scope, suggesting the approach as a robust foundation for geometry-aware model composition in future foundation-model systems.
Abstract
Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize: L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud, where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.
