Table of Contents
Fetching ...

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

Shruti Joshi, Andrea Dittadi, Sébastien Lachapelle, Dhanya Sridhar

TL;DR

This work tackles steering of LLM representations without supervision by learning steering vectors from data where multiple concepts vary simultaneously. It introduces Sparse Shift Autoencoders (SSAEs) that map embedding differences $\boldsymbol{\delta}^z$ to sparse concept shifts $\boldsymbol{\delta}^c_V$, recovering per-concept steering vectors via the decoder $\hat{q}$ and producing steering functions $\hat{\phi}_k(\mathbf{z}) = \mathbf{z} + \hat{q}(\mathbf{e}_k)$. The core theoretical contribution proves identifiability: $\hat{r}$ and $\hat{q}$ identify $\boldsymbol{\delta}^c_V$ up to permutation and scaling under sparsity and standard CRL-like assumptions, enabling extraction of meaningful steering directions. Empirically, SSAEs on Llama-3.1 embeddings demonstrate high MCCs and robust steering across semi-synthetic and real datasets, outperforming affine baselines, and maintaining effectiveness under increased entanglement and in out-of-distribution settings. Overall, the paper provides both a principled identifiability foundation and a practical unsupervised method for steering LLMs with multi-concept shifts, highlighting potential for faster, supervision-light alignment research.

Abstract

Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations. Crucially, we show that SSAEs are identifiable from paired observations that vary in \textit{multiple unknown concepts}, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

TL;DR

This work tackles steering of LLM representations without supervision by learning steering vectors from data where multiple concepts vary simultaneously. It introduces Sparse Shift Autoencoders (SSAEs) that map embedding differences to sparse concept shifts , recovering per-concept steering vectors via the decoder and producing steering functions . The core theoretical contribution proves identifiability: and identify up to permutation and scaling under sparsity and standard CRL-like assumptions, enabling extraction of meaningful steering directions. Empirically, SSAEs on Llama-3.1 embeddings demonstrate high MCCs and robust steering across semi-synthetic and real datasets, outperforming affine baselines, and maintaining effectiveness under increased entanglement and in out-of-distribution settings. Overall, the paper provides both a principled identifiability foundation and a practical unsupervised method for steering LLMs with multi-concept shifts, highlighting potential for faster, supervision-light alignment research.

Abstract

Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations. Crucially, we show that SSAEs are identifiable from paired observations that vary in \textit{multiple unknown concepts}, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.

Paper Structure

This paper contains 33 sections, 7 theorems, 51 equations, 9 figures, 6 tables.

Key Result

Proposition 1

Suppose $(\hat{r}, \hat{q})$ is a solution to the unconstrained problem of eqn:recon. Under ass:lrhass:injective_Avass:suff_var, there exists an invertible matrix $\mathbf{L} \in {\mathbb{R}}^{|V|\times|V|}$ such that $\hat{q} = \mathbf{A}_V\mathbf{L}$ and $\hat{r}(\mathbf{z}) = \mathbf{L}^{-1}\math

Figures (9)

  • Figure 1: A steering function $\phi_{\lambda, k}$ is s.t. the above diagram commutes, i.e., $\phi_{\lambda, k}(f(g(\mathbf{c}))) = f(g(\mathbf{c} + \lambda {\mathbf{e}}_k)) \forall \mathbf{c}$. (see \ref{['defn:steering_fn']}).
  • Figure 2: A higher MCC value of the estimated decoder is associated with a greater cosine similarity. Embeddings steered using the steering vectors from a more disentangled decoder are more similar to target embeddings, compared to embeddings steered using steering vectors from a decoder with a lower MCC value.
  • Figure 3: Embeddings steered using the proposed method show higher OOD generalisation performance.
  • Figure 4: Three illustrative examples of $\mathbb{P}_{\boldsymbol{\delta}^c_S | S}$: Only distribution II satisfies \ref{['ass:suffsupp']}.
  • Figure 5: UDR scores suggest a primal_lr value of $0.005$ and a $\beta$ value of 5.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 1
  • Proposition 1: Linear identifiability
  • Proposition 1: Identifiability up to permutation
  • Proposition 1: Linear identifiability
  • proof
  • Proposition 1: Identifiability up to permutation
  • proof
  • Lemma 2: lachapelle2023synergies
  • proof
  • Corollary 3
  • ...and 1 more