Table of Contents
Fetching ...

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

Sean P. Fillingham, Andrew Gordon, Peter Lai, Xavier Poncini, David Quarel, Stefan Heimersheim

TL;DR

This work tackles the problem that sparse autoencoders (SAEs) trained per layer do not guarantee sparse cross-layer interactions, which obscures circuit-level interpretability in neural models. It introduces SCALAR, a benchmark that quantifies inter-layer sparsity by ranking cross-SAE connections, ablating edges, and measuring downstream KL divergence to produce an area-under-curve score, with absolute and relative variants. The authors propose Staircase SAEs, which share encoder/decoder weights across layers to promote feature reuse and reduce spurious cross-layer connections, and compare them to TopK SAEs and Jacobian SAEs across a toy 216K-parameter model and GPT-2 Small. Across feedforward and transformer blocks, Staircase SAEs achieve substantial relative sparsity gains (up to ~63% relative improvement) while preserving feature interpretability, with SCALAR providing a principled, architecture-agnostic measure of circuit simplicity. The results suggest that architectural choices can meaningfully shape cross-layer sparsity without sacrificing per-layer interpretability, guiding future mechanistic interpretability research toward sparser, more tractable circuit analyses.

Abstract

Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer's activations. However, SAEs trained in isolation don't encourage sparse cross-layer connections, inflating extracted circuits where upstream features needlessly affect multiple downstream features. Current evaluations focus on individual SAE performance, leaving interaction sparsity unexamined. We introduce SCALAR (Sparse Connectivity Assessment of Latent Activation Relationships), a benchmark measuring interaction sparsity between SAE features. We also propose "Staircase SAEs", using weight-sharing to limit upstream feature duplication across downstream features. Using SCALAR, we compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs. Staircase SAEs improve relative sparsity over TopK SAEs by $59.67\% \pm 1.83\%$ (feedforward) and $63.15\% \pm 1.35\%$ (transformer blocks). JSAEs provide $8.54\% \pm 0.38\%$ improvement over TopK for feedforward layers but cannot train effectively across transformer blocks, unlike Staircase and TopK SAEs which work anywhere in the residual stream. We validate on a $216$K-parameter toy model and GPT-$2$ Small ($124$M), where Staircase SAEs maintain interaction sparsity improvements while preserving feature interpretability. Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

TL;DR

This work tackles the problem that sparse autoencoders (SAEs) trained per layer do not guarantee sparse cross-layer interactions, which obscures circuit-level interpretability in neural models. It introduces SCALAR, a benchmark that quantifies inter-layer sparsity by ranking cross-SAE connections, ablating edges, and measuring downstream KL divergence to produce an area-under-curve score, with absolute and relative variants. The authors propose Staircase SAEs, which share encoder/decoder weights across layers to promote feature reuse and reduce spurious cross-layer connections, and compare them to TopK SAEs and Jacobian SAEs across a toy 216K-parameter model and GPT-2 Small. Across feedforward and transformer blocks, Staircase SAEs achieve substantial relative sparsity gains (up to ~63% relative improvement) while preserving feature interpretability, with SCALAR providing a principled, architecture-agnostic measure of circuit simplicity. The results suggest that architectural choices can meaningfully shape cross-layer sparsity without sacrificing per-layer interpretability, guiding future mechanistic interpretability research toward sparser, more tractable circuit analyses.

Abstract

Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer's activations. However, SAEs trained in isolation don't encourage sparse cross-layer connections, inflating extracted circuits where upstream features needlessly affect multiple downstream features. Current evaluations focus on individual SAE performance, leaving interaction sparsity unexamined. We introduce SCALAR (Sparse Connectivity Assessment of Latent Activation Relationships), a benchmark measuring interaction sparsity between SAE features. We also propose "Staircase SAEs", using weight-sharing to limit upstream feature duplication across downstream features. Using SCALAR, we compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs. Staircase SAEs improve relative sparsity over TopK SAEs by (feedforward) and (transformer blocks). JSAEs provide improvement over TopK for feedforward layers but cannot train effectively across transformer blocks, unlike Staircase and TopK SAEs which work anywhere in the residual stream. We validate on a K-parameter toy model and GPT- Small (M), where Staircase SAEs maintain interaction sparsity improvements while preserving feature interpretability. Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.

Paper Structure

This paper contains 41 sections, 30 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: The staircase SAE architecture for a transformer with $L=3$ layers. Each layer $i$ uses a slice of the shared encoder $\mathbf{W}_{enc}$ and decoder $\mathbf{W}_{dec}$ weights. SAE chunks of identical colour indicate weights shared within the slices $\mathbf{W}^i_{enc}$ and $\mathbf{W}^i_{dec}$.
  • Figure 2: By measuring the number of active latents per chunk, we can see feature reuse from previous layers, as each SAEs allocates some "sparsity budget" to features from previous layers.
  • Figure 3: The ablation curves for all SAEs attached at the labeled compute block. In these examples, the JSAE and Staircase SAEs clearly outperform the standard TopK SAEs.
  • Figure 4: A comparison of SCALAR scores across SAE positions and variants. In this space a lower SCALAR score is suggestive of higher sparsity. So, for example, around the Transformer block at layer 1 the TopK SAE exhibits higher sparsity than the Staircase SAE using the absolute SCALAR score. However, at that same position, the Staircase SAE has higher sparsity when using the relative SCALAR score.
  • Figure 5: The L0 sparsity measured per chunk for a staircase SAE with $L=4$ layers, and 5 activations $\mathbf{h}^0, \ldots, \mathbf{h}^4$. The left figure was trained with all gradients attached, while the right figure was trained with gradients from previous chunks detached. Both models use Top-$10$ SAEs. What we find is that the standard staircase variant (left) spends some sparsity budget on features from previous chunks, whereas the detached gradient variant (right) degenerates back to a standard SAE, rarely using features from previous chunks.
  • ...and 9 more figures