DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

James Wedgwood; Aashiq Muhamed; Mona T. Diab; Virginia Smith

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith

Abstract

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Abstract

fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-

ablation is principled.

Paper Structure (68 sections, 4 theorems, 16 equations, 6 figures, 12 tables)

This paper contains 68 sections, 4 theorems, 16 equations, 6 figures, 12 tables.

Introduction
Background and Related Work
Sparse Autoencoders.
Preference Alignment.
SAEs for Preference Alignment.
Dynamic SAE Steering for Preference Alignment
Relation to representation engineering.
Identifying Conditional Preference Features
Why two SAEs?
Notation.
Activation densities.
Prompt gates.
Conditional-difference map.
Sparsification.
Inference-Time Intervention
...and 53 more sections

Key Result

Theorem 3.3

Under ass:ldaass:gating, $\mathbf{A}^\top = c\,\Sigma\,B\,M$. Consequently, rows of $\mathbf{A}$ mix multiple preference templates when gates co-activate. On subspaces where $M$ is invertible, $\Sigma^{-1}\mathbf{A}^\top M^{-1} \propto B$.

Figures (6)

Figure 1: DSPA overview: offline conditional-difference map construction and prompt-conditional SAE steering at inference. Prompts and responses are illustrative. DSPA replaces costly weight updates with prompt- and token-conditional, directly auditable SAE feature edits.
Figure 2: MT-Bench (turn 1) example on Gemma-2-9B: Base Model vs. DSPA. DSPA can improve open-ended response quality over the SFT base without weight updates.
Figure 3: MT-Bench score (higher is better) vs. preference triples $N$ under data restriction (Gemma-2-2B): DSPA vs. RAHF-SCIT. DSPA degrades gracefully down to $N{=}100$, while RAHF-SCIT drops sharply.
Figure 4: Layer choice ablations (MT-Bench with GPT-OSS-120B judge; higher is better). A. Gemma-2-2B score vs. $(\ell_{\text{input}}, \ell_{\text{output}})$ (ablate only). B. Gemma-2-9B score vs. $(\ell_{\text{input}}, \ell_{\text{output}})$ (ablate only; augment+ablate). Dashed = Base Model. An early--mid input layer and a late output layer yield the strongest scores.
Figure 5: Steering-mode and SAE ablations (MT-Bench with GPT-OSS-120B judge; higher is better). A. Score by steering mode (ablate/augment/both). 2B-Chosen and 9B-Chosen use the best $(\ell_{\text{input}}, \ell_{\text{output}})$ per model; 9B-Average averages over the layer grid in Figure \ref{['fig:ablations_layer']}B. B. Base Gemma Scope SAE vs. HH-RLHF fine-tuned SAE. Ablation-only is most reliable, and finetuned SAEs improve steering performance.
...and 1 more figures

Theorems & Definitions (7)

Theorem 3.3: Factorization of $\mathbf{A}$
Corollary 3.4: Weak co-activation bound
proof
proof
Theorem A.4: Top-$k$ ablation is optimal for linear utility
proof
Lemma A.5: Row-wise concentration

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Abstract

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)