Table of Contents
Fetching ...

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, Xiaohua Jia

TL;DR

This work reframes LLM safety alignment as a multi-dimensional problem, revealing a low-rank Safety Residual Space in which orthogonal feature directions jointly govern refusal behavior. The dominant direction predominantly predicts refusals, while non-dominant directions encode indirect safety features such as jailbreak patterns; these can even influence the dominant signal and enable vulnerability through trigger tokens. By introducing Partial Layer-wise Relevance Propagation (PLRP), the authors interpret these directions in terms of training tokens and study their layer-wise dynamics, showing a developmental trajectory from early to late layers where safety semantics stabilize. Empirical results on Llama 3.1-8B-Instruct using SSFT and DPO demonstrate how manipulating non-dominant directions or removing triggers can alter the model’s safety behavior, with Trigger Removal attacks remaining surprisingly resilient to standard safety fine-tuning. The findings offer practical insights for designing more robust alignment and highlight the importance of accounting for multi-directional, potentially spurious correlations in safety datasets and model updates.

Abstract

Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

TL;DR

This work reframes LLM safety alignment as a multi-dimensional problem, revealing a low-rank Safety Residual Space in which orthogonal feature directions jointly govern refusal behavior. The dominant direction predominantly predicts refusals, while non-dominant directions encode indirect safety features such as jailbreak patterns; these can even influence the dominant signal and enable vulnerability through trigger tokens. By introducing Partial Layer-wise Relevance Propagation (PLRP), the authors interpret these directions in terms of training tokens and study their layer-wise dynamics, showing a developmental trajectory from early to late layers where safety semantics stabilize. Empirical results on Llama 3.1-8B-Instruct using SSFT and DPO demonstrate how manipulating non-dominant directions or removing triggers can alter the model’s safety behavior, with Trigger Removal attacks remaining surprisingly resilient to standard safety fine-tuning. The findings offer practical insights for designing more robust alignment and highlight the importance of accounting for multi-directional, potentially spurious correlations in safety datasets and model updates.

Abstract

Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.

Paper Structure

This paper contains 57 sections, 1 theorem, 6 equations, 14 figures, 8 tables, 1 algorithm.

Key Result

Corollary 3.3

The safety residual space is the span of feature directions developed during safety training.

Figures (14)

  • Figure 1: Illustration of the Safety Residual Space. The safety residual space is the linear span of representation shifts during safety fine-tuning. In our experiments, the dominant direction predicts safety behavior, while non-dominant directions capture different indirect safety features.
  • Figure 2: Effective rank of the residual space by layer.
  • Figure 3: Model output prediction accuracy by layer.
  • Figure 4: Intervention results after removing the direction of the 6th component of layer 14 (L14-C6) from the hidden states during generation. L14-C6 is identified as representing the specific ability to recognize the PAIR Attack. Additionally, we remove the dominant direction (L25-C1), which completely eliminates the fine-tuned model's ability to refuse. In comparison, L14-C4 and L14-C3 also affect model behavior but do not exhibit clear selectiveness.
  • Figure 5: Top 3: Adjacent layer relevance scores among top directions. Rel Comp 1: relevance scores to first component in next layer. Bottom: Log-likelihood of predicting aligned behavior with different directions.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 3.1: Safety Residual Space
  • Corollary 3.3