Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
Van-Cuong Pham, Thien Huu Nguyen
TL;DR
This paper introduces Householder Pseudo-Rotation (HPR), a norm-preserving activation-editing method for LLMs that adopts a direction-magnitude view of activation spaces. By learning a global separating hyperplane per edited layer through a linear probe and predicting rotation angles with an angle predictor, HPR reflects negative activations and rotates them toward desirable directions, approximating a rotation while preserving activation magnitudes. Empirical results on TruthfulQA and safety-related benchmarks show that HPR outperforms steering-vector baselines and other editing methods, with strong ablations underscoring the importance of the angle predictor and the magnitude-preserving property. The work demonstrates improved truthfulness, reduced bias/ toxicity, and maintained generation quality, while offering practical efficiency via shared hyperplanes and regression-based angle computation. Overall, HPR advances activation editing by leveraging a direction-magnitude perspective to achieve robust, scalable, and safe behavioral edits in LLMs.
Abstract
Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.
