Table of Contents
Fetching ...

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith

Abstract

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.

DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Abstract

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top- ablation is principled.
Paper Structure (68 sections, 4 theorems, 16 equations, 6 figures, 12 tables)

This paper contains 68 sections, 4 theorems, 16 equations, 6 figures, 12 tables.

Key Result

Theorem 3.3

Under ass:ldaass:gating, $\mathbf{A}^\top = c\,\Sigma\,B\,M$. Consequently, rows of $\mathbf{A}$ mix multiple preference templates when gates co-activate. On subspaces where $M$ is invertible, $\Sigma^{-1}\mathbf{A}^\top M^{-1} \propto B$.

Figures (6)

  • Figure 1: DSPA overview: offline conditional-difference map construction and prompt-conditional SAE steering at inference. Prompts and responses are illustrative. DSPA replaces costly weight updates with prompt- and token-conditional, directly auditable SAE feature edits.
  • Figure 2: MT-Bench (turn 1) example on Gemma-2-9B: Base Model vs. DSPA. DSPA can improve open-ended response quality over the SFT base without weight updates.
  • Figure 3: MT-Bench score (higher is better) vs. preference triples $N$ under data restriction (Gemma-2-2B): DSPA vs. RAHF-SCIT. DSPA degrades gracefully down to $N{=}100$, while RAHF-SCIT drops sharply.
  • Figure 4: Layer choice ablations (MT-Bench with GPT-OSS-120B judge; higher is better). A. Gemma-2-2B score vs. $(\ell_{\text{input}}, \ell_{\text{output}})$ (ablate only). B. Gemma-2-9B score vs. $(\ell_{\text{input}}, \ell_{\text{output}})$ (ablate only; augment+ablate). Dashed = Base Model. An early--mid input layer and a late output layer yield the strongest scores.
  • Figure 5: Steering-mode and SAE ablations (MT-Bench with GPT-OSS-120B judge; higher is better). A. Score by steering mode (ablate/augment/both). 2B-Chosen and 9B-Chosen use the best $(\ell_{\text{input}}, \ell_{\text{output}})$ per model; 9B-Average averages over the layer grid in Figure \ref{['fig:ablations_layer']}B. B. Base Gemma Scope SAE vs. HH-RLHF fine-tuned SAE. Ablation-only is most reliable, and finetuned SAEs improve steering performance.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 3.3: Factorization of $\mathbf{A}$
  • Corollary 3.4: Weak co-activation bound
  • proof
  • proof
  • Theorem A.4: Top-$k$ ablation is optimal for linear utility
  • proof
  • Lemma A.5: Row-wise concentration