The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo
TL;DR
This work proposes Feature Steering with Reinforcement Learning (FSRL), a framework that aligns frozen LLMs by training a lightweight adapter to steer interpretable sparse features from Sparse Autoencoders (SAEs). The authors prove FSRL’s activation-space corrections are functionally equivalent to a restricted LoRA update, and validate its effectiveness on UltraFeedback, while enabling mechanistic, feature-level analysis of alignment pressures. A causal analysis reveals that preference optimization often relies on stylistic and formatting cues as proxies for quality, a phenomenon termed style hacking, which can degrade generation coherence; ablating style features partially mitigates these issues. The framework offers an interpretable, auditable alternative to full fine-tuning and model-diffing, enabling targeted diagnostics and safer, more controllable alignment workflows. Overall, FSRL demonstrates how decomposing alignment into sparse, interpretable features can illuminate the mechanics of preference optimization and guide robust improvements in future systems.
Abstract
Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
