Table of Contents
Fetching ...

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba

TL;DR

The paper questions the universality of Sparse Autoencoders (SAEs) for uncovering model concepts, arguing that SAEs embed architecture-specific data assumptions that shape what they can detect. By formulating SAEs as a bilevel optimization problem and introducing a geometry-aware SAE, SpaDE, the authors demonstrate that concepts with nonlinear separability and heterogeneous intrinsic dimensionality may be missed by traditional SAEs. Across synthetic, semi-synthetic, and real data (including language and vision tasks), SpaDE outperforms standard SAEs in discovering monosemantic, well-separated concepts and reducing latent co-occurrence. The work emphasizes designing SAEs with explicit data-geometry considerations and argues against a one-size-fits-all approach to model interpretability.

Abstract

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

TL;DR

The paper questions the universality of Sparse Autoencoders (SAEs) for uncovering model concepts, arguing that SAEs embed architecture-specific data assumptions that shape what they can detect. By formulating SAEs as a bilevel optimization problem and introducing a geometry-aware SAE, SpaDE, the authors demonstrate that concepts with nonlinear separability and heterogeneous intrinsic dimensionality may be missed by traditional SAEs. Across synthetic, semi-synthetic, and real data (including language and vision tasks), SpaDE outperforms standard SAEs in discovering monosemantic, well-separated concepts and reducing latent co-occurrence. The work emphasizes designing SAEs with explicit data-geometry considerations and argues against a one-size-fits-all approach to model interpretability.

Abstract

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

Paper Structure

This paper contains 35 sections, 9 theorems, 40 equations, 35 figures, 4 tables.

Key Result

Theorem 4.1

An SAE makes implicit assumptions about the structure of concepts in data, reflecting it in the receptive fields of its encoder. These assumptions are explicitly stated in Tab. table:implicit-assumptions for ReLU, JumpReLU and TopK SAEs (derived in App. sec:appendix-recfields).

Figures (35)

  • Figure 1: The Duality Between SAEs Architectures and Their Implicit Data Assumptions.A) SAEs do not passively extract concepts—they impose constraints that shape what can be detected. Each SAE architecture inherently assumes a specific structure in how features are encoded, leading to a corresponding dual assumption about the data. B) Different SAEs rely on different assumptions: some expect features to be linearly separable (ReLU, JumpReLU) or separable by angle while having uniform intrinsic dimensionality (TopK). These assumptions dictate what an SAE can successfully extract—and what it may miss entirely.
  • Figure 2: Projection As The Key Architectural Difference Between SAEs.A) SAE encoders do more than just linearly transform data---they project it onto an architecture-specific constraint set. This projection fundamentally determines which features an SAE can extract and which it will suppress. B) Different SAEs rely on different projection sets $\mathcal{S}$: ReLU projects onto the positive orthant, TopK onto $K-$sparse subspaces, and JumpReLU combines ReLU with a projection onto a hypercube (via a Heaviside step function).
  • Figure 3: Illustration of Two Reasonable Data Assumptions.A) Concepts may not be separable using hyperplanes. B) Some concepts are inherently low-dimensional, while others span higher-dimensional spaces.
  • Figure 4: SpaDE shows adaptive sparsity by projecting onto the probability simplex. In this illustrative $3D$ figure, note $\|\bm{x}\|_0=3$ for points on the face, $\|\bm{x}\|_0=2$ for points on edges along subspaces, and $\|\bm{x}\|_0=1$ for corners on coordinate axes.
  • Figure 5: Effect of Nonlinear Separability on SAEs. Each column represents a different SAE. a)$F_1$ scores of the top 5 most monosemantic latents (highest F1 scores), where shaded region is $\pm$1SD, of each SAE on two concepts---orange (linearly separable) and purple (non-linearly separable). SAEs that assume linear separability struggle to capture the nonlinearly separable concept. b) Receptive fields of the most monosemantic latent for each SAE, illustrating how some architectures fail to isolate the nonlinear concept cleanly. Intensity of color indicates strength of SAE latent activation. (c) Matrix of pairwise cosine similarities between sparse codes of different datapoints, and data clusters obtained through spectral clustering on this matrix. In the scatter plot, points colored by the same color belong to one spectral cluster, which intuitively indicates that they activate a common set of SAE latents. SpaDE is able to maintain clear concept boundaries and doesn't mix distinct features, while other SAEs group subsets of different features into the same spectral cluster (same color).
  • ...and 30 more figures

Theorems & Definitions (26)

  • Definition 3.1: Projection Nonlinearity
  • Claim 3.1: Bilevel optimization of SAEs
  • proof
  • Definition 3.2: Receptive Field
  • Theorem 4.1: Implicit Assumptions; Informal
  • Definition D.1: Projection Nonlinearity
  • Lemma D.2: Elementwise projections
  • proof
  • Theorem D.3: Projection Nonlinearities
  • proof
  • ...and 16 more