Table of Contents
Fetching ...

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo

TL;DR

This work proposes Feature Steering with Reinforcement Learning (FSRL), a framework that aligns frozen LLMs by training a lightweight adapter to steer interpretable sparse features from Sparse Autoencoders (SAEs). The authors prove FSRL’s activation-space corrections are functionally equivalent to a restricted LoRA update, and validate its effectiveness on UltraFeedback, while enabling mechanistic, feature-level analysis of alignment pressures. A causal analysis reveals that preference optimization often relies on stylistic and formatting cues as proxies for quality, a phenomenon termed style hacking, which can degrade generation coherence; ablating style features partially mitigates these issues. The framework offers an interpretable, auditable alternative to full fine-tuning and model-diffing, enabling targeted diagnostics and safer, more controllable alignment workflows. Overall, FSRL demonstrates how decomposing alignment into sparse, interpretable features can illuminate the mechanics of preference optimization and guide robust improvements in future systems.

Abstract

Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

TL;DR

This work proposes Feature Steering with Reinforcement Learning (FSRL), a framework that aligns frozen LLMs by training a lightweight adapter to steer interpretable sparse features from Sparse Autoencoders (SAEs). The authors prove FSRL’s activation-space corrections are functionally equivalent to a restricted LoRA update, and validate its effectiveness on UltraFeedback, while enabling mechanistic, feature-level analysis of alignment pressures. A causal analysis reveals that preference optimization often relies on stylistic and formatting cues as proxies for quality, a phenomenon termed style hacking, which can degrade generation coherence; ablating style features partially mitigates these issues. The framework offers an interpretable, auditable alternative to full fine-tuning and model-diffing, enabling targeted diagnostics and safer, more controllable alignment workflows. Overall, FSRL demonstrates how decomposing alignment into sparse, interpretable features can illuminate the mechanics of preference optimization and guide robust improvements in future systems.

Abstract

Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.

Paper Structure

This paper contains 66 sections, 28 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: The FSRL Framework for Interpretable Alignment.(a) FSRL Architecture: At a given layer, the original activation vector is processed by a trainable adapter. The adapter outputs a sparse vector of steered features, which are transformed by a frozen SAE decoder into a correction vector. This correction is added to the original activation to steer the model's behavior. (b) Application for Mechanistic Insight: FSRL replaces opaque alignment processes with a transparent one by learning a policy over a basis of interpretable, monosemantic SAE features. This allows the learned alignment pressures to be decomposed into concrete actions on meaningful concepts.
  • Figure 2: Results of the two-stage hyperparameter sweep for the Gemma-2-2B model. Top Row: Sparsity sweep performed on layer 12, showing the trade-off between final SimPO validation loss (left) and the resulting $\ell_0$ norm of the steering vector (right) for different $\alpha$ penalty coefficients. Bottom Row: Layer sweep showing the final SimPO validation loss (left) and $\ell_0$ norm (right) when intervening at different model depths (layers 6, 12, 18, 24).
  • Figure 3: SimPO training and validation loss curves for our adapters of Gemma-2-2B-it (left) and Gemma-2-9B-it (right). Both models exhibit stable convergence, effectively minimizing the preference loss over the course of training.
  • Figure 4: Comparison of static vs. dynamic steering performance. The blue line traces the validation loss for a static steering policy that activates a fixed top-k% of features, plotted on a logarithmic x-axis with sparsity levels doubled at each step from 0.1% to 12.8%. Within the tested range, this heuristic performs best at 1.60% sparsity (loss of 2.69). The isolated purple point shows the performance of our learned dynamic policy, which achieves a lower loss (2.60) with a much smaller average activation of only 0.55%, demonstrating the clear efficiency benefit of a learned, context-dependent approach.
  • Figure 5: Distribution of steered feature usage across the validation set. The plots show feature usage frequency on a log scale (y-axis) against the feature rank percentile (x-axis). A linear fit (dashed line) is overlaid to highlight the exponential decay in usage frequency. This distribution is shown for three contexts: activations from prompt tokens only, from prompt and chosen response tokens, and from prompt and rejected response tokens.