A safety realignment framework via subspace-oriented model fusion for large language models
Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He
TL;DR
The paper tackles safety degradation in LLMs after downstream fine-tuning by proposing a subspace-oriented model fusion (SOMF) framework. SOMF disentangles task vectors, identifies a safety subspace via masking using a Concrete distribution, and fuses masked task vectors with an initially aligned safe model to yield a realigned model that preserves task performance while reducing unsafe outputs. Extensive experiments across single-task and multi-task scenarios demonstrate that SOMF improves harmlessness against jailbreaks and attack prompts, with robustness across Chinese, English, and Hindi, as well as code and math tasks, while maintaining downstream task accuracy. The approach highlights the value of reusing safety information through a shared subspace during fusion, providing a scalable path for safe multi-task LLM deployment and offering competitive gains over existing safety fine-tuning and model-fusion methods.
Abstract
The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
