Table of Contents
Fetching ...

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective

Van-Cuong Pham, Thien Huu Nguyen

TL;DR

This paper introduces Householder Pseudo-Rotation (HPR), a norm-preserving activation-editing method for LLMs that adopts a direction-magnitude view of activation spaces. By learning a global separating hyperplane per edited layer through a linear probe and predicting rotation angles with an angle predictor, HPR reflects negative activations and rotates them toward desirable directions, approximating a rotation while preserving activation magnitudes. Empirical results on TruthfulQA and safety-related benchmarks show that HPR outperforms steering-vector baselines and other editing methods, with strong ablations underscoring the importance of the angle predictor and the magnitude-preserving property. The work demonstrates improved truthfulness, reduced bias/ toxicity, and maintained generation quality, while offering practical efficiency via shared hyperplanes and regression-based angle computation. Overall, HPR advances activation editing by leveraging a direction-magnitude perspective to achieve robust, scalable, and safe behavioral edits in LLMs.

Abstract

Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective

TL;DR

This paper introduces Householder Pseudo-Rotation (HPR), a norm-preserving activation-editing method for LLMs that adopts a direction-magnitude view of activation spaces. By learning a global separating hyperplane per edited layer through a linear probe and predicting rotation angles with an angle predictor, HPR reflects negative activations and rotates them toward desirable directions, approximating a rotation while preserving activation magnitudes. Empirical results on TruthfulQA and safety-related benchmarks show that HPR outperforms steering-vector baselines and other editing methods, with strong ablations underscoring the importance of the angle predictor and the magnitude-preserving property. The work demonstrates improved truthfulness, reduced bias/ toxicity, and maintained generation quality, while offering practical efficiency via shared hyperplanes and regression-based angle computation. Overall, HPR advances activation editing by leveraging a direction-magnitude perspective to achieve robust, scalable, and safe behavioral edits in LLMs.

Abstract

Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.
Paper Structure (18 sections, 21 equations, 6 figures, 6 tables)

This paper contains 18 sections, 21 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of points-in-space view (a) and direction-magnitude view (b). Positive activations are colored green and negative activations are colored red. The editing methods are depicted in in blue. Our proposed method (c) approximates the rotation transformation by first reflecting negative activations through a learned separating hyperplane and then adjusting the reflections to reach the right angle.
  • Figure 2: Probe accuracy of HPR-edited Llama2-7B-Chat on TruthfulQA. A linear probe is trained for each layer using positive-negative pairs of the training data and then evaluated on the validation data.
  • Figure 3: The activation norms in $log_{10}$ scale across $32$ transformer blocks of three popular LLMs. Each box plot represents the norm distribution in a layer of the LLMs.
  • Figure 4: Activation norm distributions of the $14^{th}$ layer of Llama2 before and after being edited. We use the $14^{th}$ layer as it has the highest probe accuracy in Figure \ref{['fig:probe_acc_llama2']}. Similar trends can be seen for other layers and models.
  • Figure 5: Illustration of the two cases when rotating vector in $2$-D plane.
  • ...and 1 more figures