Table of Contents
Fetching ...

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Amirali Abdullah

TL;DR

Curveball steering is proposed, a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.

Abstract

Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose "Curveball steering", a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.

Curveball Steering: The Right Direction To Steer Isn't Always Linear

TL;DR

Curveball steering is proposed, a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.

Abstract

Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose "Curveball steering", a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.
Paper Structure (50 sections, 31 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 50 sections, 31 equations, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: Overview of Curveball steering and empirical results.Through the nonlinear mapping $\phi$ a linear path between Classes A and B in kernel space corresponds to a nonlinear trajectory in the original activation space. This is our Curveball steering method. Empirical evaluation across two models (Llama-3.2-1B-It and Phi-3.5-mini-It) on safety-related behavioral and linguistic trait steering tasks. Evaluations show that curveball steering consistently outperforms linear steering across multiple behavioral attributes (top right). The improved performance corresponds to consistently higher curvature in the activation manifolds of the datasets. For open-ended generation steering across different emotional traits, measured as $\Delta$judge score, Curveball steering shows substantial improvements for many features (bottom right). Examples demonstrate a binary choice question where steering influences the model's probability of selecting the power-seeking response, and a prompt with a general question with a neutral and enthusiastic response.
  • Figure 2: Evidence of geometric distortions in LLM activation spaces. (a) Illustration of Euclidean distance $d_{\mathrm{Eucl}}$ versus geodesic distance $d_{\mathrm{geo}}$ on a curved manifold, motivating the distortion ratio $R = d_{\mathrm{geo}} / d_{\mathrm{Euc}}$. (b) Empirical distributions of distortion ratios computed on LLM activation datasets corresponding to different concepts ("self-awareness", "wealth-seeking", "corrigible-more", and "power-seeking"). To test the Linear Representation Hypothesis of LLM activations park2024linearrepresentationhypothesisgeometry, we learn latent manifolds and associated Riemannian metrics using pullback metrics from ensembles of variational autoencoders syrota2024decoderensemblinglearnedlatentarvanitidis2021latentspaceodditycurvature (Sec. \ref{['sec:geometry_motivation']} and Appendix \ref{['app:vae-geometry']}), and estimate geometric distortion as the ratio $d_{\mathrm{geo}} / d_{\mathrm{Euc}}$ over randomly sampled activation pairs. A Euclidean (locally linear and isometric) activation space would concentrate near $R = 1$ (dashed line). Consistent deviations of ${R}$ from $1$ provide quantitative evidence that straight-line interpolation does not preserve intrinsic distances, rejecting the linearity hypothesis for LLM activation spaces. These results motivate geometry-aware, nonlinear steering methods that respect their manifold structure, in contrast to global linear directions such as PCA-based steering.
  • Figure 3: Curveball steering is most effective for high curvature manifolds.We create synthetic datasets where the curvature is parametrized by $\kappa \in \{0.1, 1.0, 5.0, 10.0, 20\}$ As curvature increases, the distortion metric (Geodesic-to-Euclidean distance ratio) increases (bottom) Performance comparison heatmap showing difference in target distance ($\Delta$ Target distance) between Curveball and linear steering. Blue regions indicate Curveball achieves lower distance to the target class centroid (better steering effectiveness). Curveball consistently outperforms in high-curvature regimes ($\kappa > 8$). Manifold tangent space deviation comparison showing $\Delta$ Tangent space deviation between methods. Blue regions indicate Curveball maintains lower deviation from the local tangent space of the learned manifold than linear steering.
  • Figure 4: Steering response curves across behavioral concepts show Curveball steering achieves stronger behavioral control. We demonstrate steerability as probability of selecting the behavior-matching answer option for four behavioral concepts: corrigible (blue), wealth-seeking (teal), power-seeking (purple), and self-awareness (red), for llama-3.2-1B-Instruct (dashed lines) and phi-3.5-mini-Instruct (solid lines). Curveball steering achieves substantial behavioral shifts across most concepts, while linear steering shows weaker control.
  • Figure 5: Curveball steering induces locally adaptive steering trajectories in the activation space, case study with Corribigle datasetCosine similarity between the global linear steering vector and steering vectors computed on each cluster computed via k-means on the negative label activations indicates that optimal local steering directions deviate substantially from the global linear direction. Curveball steering adapts the magnitude of steering in ambient activation space despite uniform steering strength in latent KPCA space, vindicated by a wide spread in steering magnitudes. 2D PCA projection of subcluster-specific steering vectors (blue squares) versus the global linear steering vector (black star). Subcluster vectors form distinct clusters far from the global direction, demonstrating that different activation regions require different steering directions. Analysis of point-wise KPCA steering displacements via local perturbations. Left: Distribution of cosine similarities to global mean-difference direction shows bimodal structure with high variance, indicating diverse local steering directions. Right: 2D projection onto the global PCA and dominant orthogonal direction confirms the multi-modal nature of Curveball steering that adapts to local manifold geometry.
  • ...and 13 more figures