Table of Contents
Fetching ...

A safety realignment framework via subspace-oriented model fusion for large language models

Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He

TL;DR

The paper tackles safety degradation in LLMs after downstream fine-tuning by proposing a subspace-oriented model fusion (SOMF) framework. SOMF disentangles task vectors, identifies a safety subspace via masking using a Concrete distribution, and fuses masked task vectors with an initially aligned safe model to yield a realigned model that preserves task performance while reducing unsafe outputs. Extensive experiments across single-task and multi-task scenarios demonstrate that SOMF improves harmlessness against jailbreaks and attack prompts, with robustness across Chinese, English, and Hindi, as well as code and math tasks, while maintaining downstream task accuracy. The approach highlights the value of reusing safety information through a shared subspace during fusion, providing a scalable path for safe multi-task LLM deployment and offering competitive gains over existing safety fine-tuning and model-fusion methods.

Abstract

The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.

A safety realignment framework via subspace-oriented model fusion for large language models

TL;DR

The paper tackles safety degradation in LLMs after downstream fine-tuning by proposing a subspace-oriented model fusion (SOMF) framework. SOMF disentangles task vectors, identifies a safety subspace via masking using a Concrete distribution, and fuses masked task vectors with an initially aligned safe model to yield a realigned model that preserves task performance while reducing unsafe outputs. Extensive experiments across single-task and multi-task scenarios demonstrate that SOMF improves harmlessness against jailbreaks and attack prompts, with robustness across Chinese, English, and Hindi, as well as code and math tasks, while maintaining downstream task accuracy. The approach highlights the value of reusing safety information through a shared subspace during fusion, providing a scalable path for safe multi-task LLM deployment and offering competitive gains over existing safety fine-tuning and model-fusion methods.

Abstract

The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
Paper Structure (32 sections, 10 equations, 8 figures, 14 tables)

This paper contains 32 sections, 10 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: A framework for the safety realignment of LLMs via subspace-oriented model fusion (SOMF). The safety level of LLMs is analogized to the structural integrity of protective armor. A securely aligned model is metaphorically depicted as being clad in immaculate armor, while SFT compromises the model's safety, akin to the degradation or destruction of said armor. Our aim is to facilitate safety realignment through a three-step approach encompassing the construction of task vectors, subspace masking, and model fusion, aimed at restoring compromised defensive mechanisms, akin to repairing damaged armor.
  • Figure 2: Helpfulness evaluation. We compare with the base model to quantify the helpfulness preference rate.
  • Figure 3: Relevance of different layers of the model to safety. We conduct a comparison of Pearson correlation coefficients in the attention layer's $W_v$ matrix. Our analysis involves two models as the default selection: one subjected to task-specific fine-tuning and the other undergoing safety realignment.
  • Figure 4: Safety-related regions in task vectors by PEFT training strategy. We examine the task vector corresponding to the parameters of the model's first layer. A visual representation of 225 randomly sampled positions is provided.
  • Figure 5: The model's safety is assessed across the harmful question categories on the CATQA dataset after Hindi task fine-tuning. PEFT and Full-FT training strategies are leveraged to compare the model's safety profile before and after the realignment by continuous mask.
  • ...and 3 more figures