LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue; Baharan Mirzasoleiman

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, Baharan Mirzasoleiman

TL;DR

This work tackles the Safety Tax observed when aligning reasoning-capable LLMs for safety by proposing Low-Rank Adaptation (LoRA) during supervised safety fine-tuning on a direct refusal dataset. Across 7B and 14B DeepSeek-derived models, LoRA achieves safety levels comparable to full-model fine-tuning while largely preserving the original reasoning capabilities, reducing computational cost. Key findings show that rank-1 updates, applying LoRA to up projections in MLPs, and targeting middle layers yield the best reasoning–safety tradeoffs. The authors also analyze the geometry of LoRA updates, noting reduced alignment with initial weights, and explore post-hoc methods to further reduce overlap, suggesting promising future directions for robust safety–reasoning optimization.

Abstract

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs--with safety levels comparable to full-model fine-tuning--without compromising their reasoning abilities. Our ablation studies further identify three key factors in LoRA: (1) rank-$1$ updates are sufficient to achieve the best reasoning and safety performance, (2) the up projection layers are the most critical modules, with LoRA applied to them alone achieving even better results, and (3) middle layers are more effective than early or late layers. Together, these findings show that strong safety and reasoning can be achieved at minimal computational cost when updates are applied in the right places. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. Finally, while our attempts to further reduce this overlap yield only modest improvements on some tasks, they highlight the potential of developing methods that more reliably optimize the reasoning-safety tradeoff.

LoRA is All You Need for Safety Alignment of Reasoning LLMs

TL;DR

Abstract

LoRA is All You Need for Safety Alignment of Reasoning LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)