Table of Contents
Fetching ...

Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

Yutao Mou, Xiaoling Zhou, Yuxiao Luo, Shikun Zhang, Wei Ye

TL;DR

The paper tackles the high cost and performance degradation inherent in safety-alignment of large language models by introducing LoRA-based Refusal-training that uses only safety data. It provides a theoretical account—transformation subspace orthogonality—showing that safety updates occupy a low-rank subspace largely orthogonal to the model's intrinsic transformations, thereby avoiding interference with core capabilities. Empirically, LoRA-based safety patches yield strong jailbreak defense with minimal loss in general performance, outperforming full-parameter methods in data-balancing scenarios and enabling plug-and-play, lifelong safety patching. The work demonstrates cross-domain advantages, analyzes the role of LoRA rank, and discusses limitations and future directions for adaptive attackers and broader model types, highlighting practical impact for cost-efficient, scalable safety alignment in evolving AI systems.

Abstract

Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.

Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

TL;DR

The paper tackles the high cost and performance degradation inherent in safety-alignment of large language models by introducing LoRA-based Refusal-training that uses only safety data. It provides a theoretical account—transformation subspace orthogonality—showing that safety updates occupy a low-rank subspace largely orthogonal to the model's intrinsic transformations, thereby avoiding interference with core capabilities. Empirically, LoRA-based safety patches yield strong jailbreak defense with minimal loss in general performance, outperforming full-parameter methods in data-balancing scenarios and enabling plug-and-play, lifelong safety patching. The work demonstrates cross-domain advantages, analyzes the role of LoRA rank, and discusses limitations and future directions for adaptive attackers and broader model types, highlighting practical impact for cost-efficient, scalable safety alignment in evolving AI systems.

Abstract

Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.

Paper Structure

This paper contains 31 sections, 12 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (a) LoRA-based SFT achieves better safety–utility trade-off than full-parameter training. (b) Schematic illustration of transformation spaces induced by LoRA (right) and full-parameter (left) training. The $\Delta W$ subspace from LoRA training is orthogonal to the model’s $W_0$ subspace, avoiding interference, while full-parameter training produces non-orthogonal and interfering subspaces.
  • Figure 2: Impact of different choices of general-purpose data in Refusal-SFT on LLM safety and general capabilities. Higher scores indicate better performance, yet achieving an optimal balance between safety and general performance remains challenging.
  • Figure 3: Effect of different general-purpose data ratios during Refusal-SFT training on safety and general capabilities of LLMs. Varying the proportion of general-purpose data reveals an inherent trade-off between safety and general performance.
  • Figure 4: LoRA for safety alignment: (a) cost-efficient and performance-preserving alternative to full-parameter fine-tuning; (b) plug-and-play safety patching in multi-round red-teaming and continuous learning.
  • Figure 5: Comparison of parameter update magnitudes across different safety alignment methods.
  • ...and 8 more figures