Table of Contents
Fetching ...

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, Baharan Mirzasoleiman

TL;DR

This work tackles the Safety Tax observed when aligning reasoning-capable LLMs for safety by proposing Low-Rank Adaptation (LoRA) during supervised safety fine-tuning on a direct refusal dataset. Across 7B and 14B DeepSeek-derived models, LoRA achieves safety levels comparable to full-model fine-tuning while largely preserving the original reasoning capabilities, reducing computational cost. Key findings show that rank-1 updates, applying LoRA to up projections in MLPs, and targeting middle layers yield the best reasoning–safety tradeoffs. The authors also analyze the geometry of LoRA updates, noting reduced alignment with initial weights, and explore post-hoc methods to further reduce overlap, suggesting promising future directions for robust safety–reasoning optimization.

Abstract

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs--with safety levels comparable to full-model fine-tuning--without compromising their reasoning abilities. Our ablation studies further identify three key factors in LoRA: (1) rank-$1$ updates are sufficient to achieve the best reasoning and safety performance, (2) the up projection layers are the most critical modules, with LoRA applied to them alone achieving even better results, and (3) middle layers are more effective than early or late layers. Together, these findings show that strong safety and reasoning can be achieved at minimal computational cost when updates are applied in the right places. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. Finally, while our attempts to further reduce this overlap yield only modest improvements on some tasks, they highlight the potential of developing methods that more reliably optimize the reasoning-safety tradeoff.

LoRA is All You Need for Safety Alignment of Reasoning LLMs

TL;DR

This work tackles the Safety Tax observed when aligning reasoning-capable LLMs for safety by proposing Low-Rank Adaptation (LoRA) during supervised safety fine-tuning on a direct refusal dataset. Across 7B and 14B DeepSeek-derived models, LoRA achieves safety levels comparable to full-model fine-tuning while largely preserving the original reasoning capabilities, reducing computational cost. Key findings show that rank-1 updates, applying LoRA to up projections in MLPs, and targeting middle layers yield the best reasoning–safety tradeoffs. The authors also analyze the geometry of LoRA updates, noting reduced alignment with initial weights, and explore post-hoc methods to further reduce overlap, suggesting promising future directions for robust safety–reasoning optimization.

Abstract

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs--with safety levels comparable to full-model fine-tuning--without compromising their reasoning abilities. Our ablation studies further identify three key factors in LoRA: (1) rank- updates are sufficient to achieve the best reasoning and safety performance, (2) the up projection layers are the most critical modules, with LoRA applied to them alone achieving even better results, and (3) middle layers are more effective than early or late layers. Together, these findings show that strong safety and reasoning can be achieved at minimal computational cost when updates are applied in the right places. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. Finally, while our attempts to further reduce this overlap yield only modest improvements on some tasks, they highlight the potential of developing methods that more reliably optimize the reasoning-safety tradeoff.

Paper Structure

This paper contains 24 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: We compute the stable rank of the difference between the full-model fine-tuned model’s weights and those of the original DeepSeek-R1-Distill-Qwen-14B for each layer. Here, the colors indicate the module types, and the x-axis shows the layer indices. We observe that the stable rank is quite high—ranging from around 40 to 100 for most layers
  • Figure 2: LoRA bypasses the "Safety Tax”, achieving safety comparable to that of the full-model fine-tuned model and reasoning performance comparable to the original reasoning model. We plot reasoning performance—measured by Pass@1—against safety scores for different models. For the fine-tuned models, we report results for checkpoints at all epochs. Results on the base versions of HumanEval and MBPP are provided in Figure \ref{['fig:lora_vs_full_code_base']} in the Appendix, where the same patterns hold, but with higher accuracy.
  • Figure 3: In (a) and (b), we show reasoning and safety performance at different LoRA ranks $r$ for the 14B model, respectively. Full-model fine-tuning is included as the rightmost point for reference. Reasoning performance decreases as $r$ increases, while safety first decreases and then increases. Overall, very low ranks are recommended, and $r=1$ already achieves the best performance in both metrics. In (c), we visualize the Pareto frontiers of the reasoning–safety tradeoff when the training epoch is varied and observe that $r=1$ is sufficient to achieve an excellent tradeoff, outperforming other ranks, especially on AIME.
  • Figure 4: Applying LoRA to MLP modules alone is sufficient. Faded points indicate non–Pareto-frontier points.
  • Figure 5: We compare applying LoRA to different projections within the MLP layers. The results show that applying it only to the up projection achieves the best tradeoff, and even outperforms applying it to the full MLP on the coding benchmarks.
  • ...and 4 more figures