Table of Contents
Fetching ...

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Jeffrey Olmo, Jared Wilson, Max Forsey, Bryce Hepner, Thomas Vin Howe, David Wingate

TL;DR

This work addresses the gap where Sparse Autoencoders (SAEs) optimize reconstruction without accounting for downstream model impact. It introduces Gradient SAEs (g-SAEs), which augment the TopK sparsity with a gradient-weighted term $\beta \mathbf{z} \circ \left|W_{\text{dec}}^T \cdot \nabla_{\mathbf{x}}\mathcal{L}(\mathbf{x})\right|$, selecting the $k$ latent activations that not only carry strong signal but also strongly influence loss, enabling reconstructions that more faithfully preserve downstream behavior. Empirically, g-SAEs yield improved downstream loss compatibility, fewer dead latents, and latents that steer logits more effectively in arbitrary contexts, while maintaining interpretability comparable to standard TopK SAEs; these benefits persist across model sizes such as GPT-2 variants. The findings support a dual view of features as both representations and actions, offering a practical path to more faithful interpretability and finer-grained control of large language models. This advances dictionary learning by explicitly incorporating downstream effects into feature discovery, with potential implications for model steering and safety alongside interpretability.

Abstract

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

TL;DR

This work addresses the gap where Sparse Autoencoders (SAEs) optimize reconstruction without accounting for downstream model impact. It introduces Gradient SAEs (g-SAEs), which augment the TopK sparsity with a gradient-weighted term , selecting the latent activations that not only carry strong signal but also strongly influence loss, enabling reconstructions that more faithfully preserve downstream behavior. Empirically, g-SAEs yield improved downstream loss compatibility, fewer dead latents, and latents that steer logits more effectively in arbitrary contexts, while maintaining interpretability comparable to standard TopK SAEs; these benefits persist across model sizes such as GPT-2 variants. The findings support a dual view of features as both representations and actions, offering a practical path to more faithful interpretability and finer-grained control of large language models. This advances dictionary learning by explicitly incorporating downstream effects into feature discovery, with potential implications for model steering and safety alongside interpretability.

Abstract

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the -sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both , retrospectively, and , prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.

Paper Structure

This paper contains 18 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The setup of g-SAE training. Here $\mathcal{L}(\mathbf{x})$ is the function mapping the residual stream activation at the layer the SAE was trained on, to the predictive cross entropy loss it yields. The red dotted line denotes backpropagation.
  • Figure 2: Comparison of directional derivatives in GPT-2's residual stream. The plot shows the absolute value of the directional derivatives towards activation differences vs. random directions. Layer $n$ corresponds to the $n_\text{th}$resid_post hookpoint. In later layers, feature directions exhibit consistently higher derivatives with respect to loss.
  • Figure 3: Left: Spearman correlation coefficients between change in cross-entropy loss $\Delta \mathcal{L}(\mathbf{x})$ from isotropically perturbing an MLP output, and the first order approximation of that change ($|\nabla_\mathbf{x}{\mathcal{L}(\mathbf{x})} \cdot \delta{\mathbf{x}}|$). Right: Correlations between the norm of a perturbation and the resulting effect on loss. Perturbations for each column are drawn from a uniform distribution with the displayed mean and a standard deviation of one-half the mean. Data from resid_post hookpoint of GPT-2 across diverse tokens.
  • Figure 4: $\textbf{Left:}$ Number of latents vs Loss Added ($\mathcal{L}_{\text{added}}$) for $L_0$=32. $\textbf{Middle:}$$L_0$ against NMSE holding the number of latents fixed to 15360. $\textbf{Right:}$$L_0$ against Loss Added, holding number of latents fixed. All SAEs were trained on GPT-2 small with $\sim$14M tokens.
  • Figure 5: Top: The average effect of applying a steering vector in the direction of a latent $\mathbf{y}_i$ on the logits $\mathbf{y}_i$ points towards. Higher is better.Bottom: The average total probability added to all other logits when applying a steering vector in the direction of $\mathbf{y}_i$. Lower is better. Data from a random samples of latents from SAEs with $L_0 = 32$, with $\alpha$ incremented in steps of 5/3 above, and 10 below. Gaussian smoothing applied for better visibility.
  • ...and 5 more figures