Table of Contents
Fetching ...

Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma

TL;DR

The paper investigates whether safety alignment in LLMs can be confined to linearly separable weight or activation subspaces. It introduces a formal framework to decompose updates into alignment/safety and task-specific components, then analyzes these via top-$k$ subspaces, energy-kept ratios $E_k$, and mode subspace overlap MSO across five open-source models. Across weight and activation spaces, it finds that subspaces derived from alignment or safety updates amplify both safe and useful behaviors, with no distinct safety-specific subspace emerging; activation analyses likewise reveal overlapping representations rather than dedicated safety regions. The results imply fundamental limits for subspace-based defenses and call for alternative strategies to preserve safety under continued training, as safety and general learning appear deeply entangled in high-impact, shared subspaces.

Abstract

Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

TL;DR

The paper investigates whether safety alignment in LLMs can be confined to linearly separable weight or activation subspaces. It introduces a formal framework to decompose updates into alignment/safety and task-specific components, then analyzes these via top- subspaces, energy-kept ratios , and mode subspace overlap MSO across five open-source models. Across weight and activation spaces, it finds that subspaces derived from alignment or safety updates amplify both safe and useful behaviors, with no distinct safety-specific subspace emerging; activation analyses likewise reveal overlapping representations rather than dedicated safety regions. The results imply fundamental limits for subspace-based defenses and call for alternative strategies to preserve safety under continued training, as safety and general learning appear deeply entangled in high-impact, shared subspaces.

Abstract

Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

Paper Structure

This paper contains 43 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The base model $W_0$ is aligned/safety-tuned to produce the model $W_{A/S}$. Step 1: The difference $\Delta_{A/S} = W_{A/S} - W_o$ defines an alignment/safety-specific direction, from which projection matrices $P_k$ (top-K subspace) and $P_k^\perp$ (orthogonal subspace) are derived. $W_{A/S}$ is then fine-tuned on three datasets: helpful, harmful, and contaminated, to yield $W_{\text{useful}}$, $W_{\text{harmful}}$, and $W_{\text{contaminated}}$, with updates $\Delta_{t_j}$. Step 2: Project $\Delta_{t_j}$ using $P_k$ and $P_k^\perp$, and add back to $W_{A/S}$ to obtain projected models for evaluation. In addition, SVD is performed on the task-specific updates, and the Mode Subspace Overlap (MSO) is computed between the top-K singular vectors.
  • Figure 2: Parallel projection-based update schemes across varying SVD fractions. We report the energy-kept ratio for models fine-tuned on Full Useful and Full Harmful data, utility for models fine-tuned on Full Useful, and harmfulness for models fine-tuned on Full Harmful.
  • Figure 3: Parallel projection-based update schemes across varying SVD fractions. We report the energy-kept ratio for models fine-tuned on Full Useful, Full Harmful and Contaminated data; and utility and harmfulness for models fine-tuned on Contaminated.
  • Figure 4: Mode Subspace Overlap (MSO) at the 70- and 85- percentile layers for pairwise comparisons of the dominant subspaces from Harmful fine-tuned (H), Aligned (A), and Base (B) models.
  • Figure 5: Average Mode Subspace Overlap (MSO) across layers in the 65–90% depth range for pairwise comparisons of activations from Useful (U) and multiple Harmful (H1, H2) prompt sets.
  • ...and 4 more figures