SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs
Sean P. Fillingham, Andrew Gordon, Peter Lai, Xavier Poncini, David Quarel, Stefan Heimersheim
TL;DR
This work tackles the problem that sparse autoencoders (SAEs) trained per layer do not guarantee sparse cross-layer interactions, which obscures circuit-level interpretability in neural models. It introduces SCALAR, a benchmark that quantifies inter-layer sparsity by ranking cross-SAE connections, ablating edges, and measuring downstream KL divergence to produce an area-under-curve score, with absolute and relative variants. The authors propose Staircase SAEs, which share encoder/decoder weights across layers to promote feature reuse and reduce spurious cross-layer connections, and compare them to TopK SAEs and Jacobian SAEs across a toy 216K-parameter model and GPT-2 Small. Across feedforward and transformer blocks, Staircase SAEs achieve substantial relative sparsity gains (up to ~63% relative improvement) while preserving feature interpretability, with SCALAR providing a principled, architecture-agnostic measure of circuit simplicity. The results suggest that architectural choices can meaningfully shape cross-layer sparsity without sacrificing per-layer interpretability, guiding future mechanistic interpretability research toward sparser, more tractable circuit analyses.
Abstract
Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer's activations. However, SAEs trained in isolation don't encourage sparse cross-layer connections, inflating extracted circuits where upstream features needlessly affect multiple downstream features. Current evaluations focus on individual SAE performance, leaving interaction sparsity unexamined. We introduce SCALAR (Sparse Connectivity Assessment of Latent Activation Relationships), a benchmark measuring interaction sparsity between SAE features. We also propose "Staircase SAEs", using weight-sharing to limit upstream feature duplication across downstream features. Using SCALAR, we compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs. Staircase SAEs improve relative sparsity over TopK SAEs by $59.67\% \pm 1.83\%$ (feedforward) and $63.15\% \pm 1.35\%$ (transformer blocks). JSAEs provide $8.54\% \pm 0.38\%$ improvement over TopK for feedforward layers but cannot train effectively across transformer blocks, unlike Staircase and TopK SAEs which work anywhere in the residual stream. We validate on a $216$K-parameter toy model and GPT-$2$ Small ($124$M), where Staircase SAEs maintain interaction sparsity improvements while preserving feature interpretability. Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.
