Table of Contents
Fetching ...

Spherical Steering: Geometry-Aware Activation Rotation for Language Models

Zejia You, Chunyuan Deng, Hanjie Chen

TL;DR

The paper addresses reliable, training-free control of language models at inference by overcoming the magnitude distortion inherent in additive activation edits. It introduces Spherical Steering, a norm-preserving rotation on the unit hypersphere toward a contrastive truthfulness axis, guided by a vMF-based confidence gate for input-adaptive steering. Offline prototype construction creates a direction mu_T that encodes truthfulness, and inference-time rotation along geodesics preserves activation magnitude while sharpening decision boundaries. Across LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, the method achieves Pareto improvements in multiple-choice accuracy and open-ended generation, with analyses showing that truthfulness is encoded directionally and rotation yields superior collapse efficiency. The work presents a robust, geometry-aware primitive for precise inference-time control with practical benefits for real-world deployment and prompts future exploration of multi-layer calibration and per-layer tuning.

Abstract

Inference-time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open-ended generation capabilities. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control.

Spherical Steering: Geometry-Aware Activation Rotation for Language Models

TL;DR

The paper addresses reliable, training-free control of language models at inference by overcoming the magnitude distortion inherent in additive activation edits. It introduces Spherical Steering, a norm-preserving rotation on the unit hypersphere toward a contrastive truthfulness axis, guided by a vMF-based confidence gate for input-adaptive steering. Offline prototype construction creates a direction mu_T that encodes truthfulness, and inference-time rotation along geodesics preserves activation magnitude while sharpening decision boundaries. Across LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, the method achieves Pareto improvements in multiple-choice accuracy and open-ended generation, with analyses showing that truthfulness is encoded directionally and rotation yields superior collapse efficiency. The work presents a robust, geometry-aware primitive for precise inference-time control with practical benefits for real-world deployment and prompts future exploration of multi-layer calibration and per-layer tuning.

Abstract

Inference-time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open-ended generation capabilities. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control.
Paper Structure (50 sections, 19 equations, 6 figures, 5 tables)

This paper contains 50 sections, 19 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Spherical Steering improves both multiple choice accuracy and generation quality. (a) Our rotation-based intervention moves results toward the upper-right compared to activation addition baselines on LLaMA-3.1-8B-Instruct (circles) and Qwen-2.5-7B-Instruct (triangles). (b) Case study that activation addition can be overly conservative, while Spherical Steering produces a more informative and grounded response.
  • Figure 2: Overview of Spherical Steering. Contrastive pairs define a truthfulness axis $(\mu_T,\mu_H)$. We steer by a norm-preserving geodesic rotation of hidden activations toward $\mu_T$, and use a vMF confidence gate to selectively apply steering with input-adaptive strength $t$.
  • Figure 3: Activation-norm analysis on TruthfulQA for LLaMA-3.1-8B-Instruct. Top: mean $\ell_2$ norm of last-token activations at each residual layer for correct (denoted as Truthful) vs. incorrect answers (denoted as Hallucinated), shaded areas indicate mean$\pm$std. The curves nearly overlap, indicating similar activation magnitudes of the same layer. Bottom: $\Delta\text{Norm}$, defined as the mean norm of correct answers minus that of incorrect answers at each layer.
  • Figure 4: Analysis of efficient rank on LLaMA-3.1-8B-Insturct. We sweep intervention strength of two types of steering mechanism. Spherical Steering achieves much larger performance gains than addition steering at comparable rank drop.
  • Figure 5: Ablation of vMF gating on Qwen2.5-7B-Instruct. We sweep intervention strength $\alpha$. Gating improves multiple-choice accuracy while maintaining generation quality.
  • ...and 1 more figures