Table of Contents
Fetching ...

Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

Weixiang Zhao, Yulin Hu, Zhuojun Li, Yang Deng, Jiahe Guo, Xingyu Sui, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

TL;DR

This work introduces SafePatching, a post safety alignment framework for LLMs that concurrently achieves Safety Enhancement, Over-Safety Mitigation, and Utility Preservation. It derives two patches from harmful data—one via gradient ascent to erase unsafe content and another via gradient descent to loosen safety constraints—and merges them into the backbone using a controllable patching scheme that preserves utility, including random retention and SNIP-guided difference-set targeting. Extensive experiments across multiple aligned LLMs show SafePatching outperforms baselines in safety, reduces false refusals, and maintains or even improves task performance, with demonstrated effectiveness in continual PSA settings. The method offers scalable, efficient PSA with principled handling of patch conflicts and a detailed analysis of patch distributions and hyper-parameters, providing a practical benchmark for PSA in real-world LLM deployments.

Abstract

Safety alignment of large language models (LLMs) has been gaining increasing attention. However, current safety-aligned LLMs suffer from the fragile and imbalanced safety mechanisms, which can still be induced to generate unsafe responses, exhibit over-safety by rejecting safe user inputs, and fail to preserve general utility after safety alignment. To this end, we propose a novel post safety alignment (PSA) method to address these inherent and emerging safety challenges, including safety enhancement, over-safety mitigation, and utility preservation. In specific, we introduce \textsc{SafePatching}, a novel framework for comprehensive PSA, where two distinct safety patches are developed on the harmful data to enhance safety and mitigate over-safety concerns, and then seamlessly integrated into the target LLM backbone without compromising its utility. Extensive experiments on four representative aligned LLMs, including LLaMA-2/3, Gemma and Mistral, show that \textsc{SafePatching} achieves a more comprehensive PSA than baseline methods, further optimizing the balance between being helpful and harmless in current aligned LLMs. Also, \textsc{SafePatching} demonstrates its superiority in continual PSA scenarios.

Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

TL;DR

This work introduces SafePatching, a post safety alignment framework for LLMs that concurrently achieves Safety Enhancement, Over-Safety Mitigation, and Utility Preservation. It derives two patches from harmful data—one via gradient ascent to erase unsafe content and another via gradient descent to loosen safety constraints—and merges them into the backbone using a controllable patching scheme that preserves utility, including random retention and SNIP-guided difference-set targeting. Extensive experiments across multiple aligned LLMs show SafePatching outperforms baselines in safety, reduces false refusals, and maintains or even improves task performance, with demonstrated effectiveness in continual PSA settings. The method offers scalable, efficient PSA with principled handling of patch conflicts and a detailed analysis of patch distributions and hyper-parameters, providing a practical benchmark for PSA in real-world LLM deployments.

Abstract

Safety alignment of large language models (LLMs) has been gaining increasing attention. However, current safety-aligned LLMs suffer from the fragile and imbalanced safety mechanisms, which can still be induced to generate unsafe responses, exhibit over-safety by rejecting safe user inputs, and fail to preserve general utility after safety alignment. To this end, we propose a novel post safety alignment (PSA) method to address these inherent and emerging safety challenges, including safety enhancement, over-safety mitigation, and utility preservation. In specific, we introduce \textsc{SafePatching}, a novel framework for comprehensive PSA, where two distinct safety patches are developed on the harmful data to enhance safety and mitigate over-safety concerns, and then seamlessly integrated into the target LLM backbone without compromising its utility. Extensive experiments on four representative aligned LLMs, including LLaMA-2/3, Gemma and Mistral, show that \textsc{SafePatching} achieves a more comprehensive PSA than baseline methods, further optimizing the balance between being helpful and harmless in current aligned LLMs. Also, \textsc{SafePatching} demonstrates its superiority in continual PSA scenarios.
Paper Structure (61 sections, 13 equations, 10 figures, 14 tables)

This paper contains 61 sections, 13 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: The three main objectives for post safety alignment of LLMs: (1) Safety Enhancement, (2) Over-Safety Mitigation and (3) Utility Preservation.
  • Figure 2: The overall architecture of our proposed SafePatching. (1) In the Patch Derivation, based on the aligned LLM $\theta$, we separately perform gradient ascent and gradient decent on the same set of harmful data $D_h$. Patches for safety enhancement $\boldsymbol{P}^{\textrm{SE}}$ and over-safety mitigation $\boldsymbol{P}^{\textrm{OSM}}$ are derived from the difference between the two resulting models and the target LLM backbone. (2) In the Controllable Patching, random parameter retention is first performed on each patch to mitigate the potential conflicts between them and the backbone for utility preservation. Then the control is further exerted on the two patches to retain the difference set of their most important parameters.
  • Figure 3: Visualization of the difference set of the most important parameters from the safety patch (left) and over-safety patch. The backbone is LLaMA-2-7B-Chat.
  • Figure 4: The performance on safety, over-safety, and utility of SafePatching under different parameter retention rates. The backbone is LLaMA-2-7B-Chat.
  • Figure 5: The performance in terms of safety, over-safety, and utility of our SafePatching under different top rate and scale weight. The x-axis represents the setting of $(a, b)$ and $(\alpha, \beta)$. The settings adopted in our main experiments are in bold and the backbone is LLaMA-2-7B-Chat.
  • ...and 5 more figures