Table of Contents
Fetching ...

Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

Tsogt-Ochir Enkhbayar

TL;DR

The paper tackles validating sparse autoencoder features for neural interpretability while controlling false discoveries in high-dimensional settings. It adapts Model-X knockoffs to SAE latents, employing Gaussian knockoffs with energy-based feature reduction and an L1-regularized logistic model to obtain knockoff statistics and FDR-controlled feature selection. In a sentiment classification experiment on SST-2 with Pythia-70M, it identifies 129 genuine SAE features from 512 candidates at a configurable FDR level, achieving a 5.40× signal-to-noise ratio and a training accuracy of 77.4% on the augmented design. This framework provides finite-sample statistical guarantees, separates real signals from noise and spurious correlations, and offers a reusable, reproducible approach for principled mechanistic interpretability with broad applicability.

Abstract

Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.

Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

TL;DR

The paper tackles validating sparse autoencoder features for neural interpretability while controlling false discoveries in high-dimensional settings. It adapts Model-X knockoffs to SAE latents, employing Gaussian knockoffs with energy-based feature reduction and an L1-regularized logistic model to obtain knockoff statistics and FDR-controlled feature selection. In a sentiment classification experiment on SST-2 with Pythia-70M, it identifies 129 genuine SAE features from 512 candidates at a configurable FDR level, achieving a 5.40× signal-to-noise ratio and a training accuracy of 77.4% on the augmented design. This framework provides finite-sample statistical guarantees, separates real signals from noise and spurious correlations, and offers a reusable, reproducible approach for principled mechanistic interpretability with broad applicability.

Abstract

Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.

Paper Structure

This paper contains 29 sections, 1 theorem, 10 equations, 1 figure, 2 tables, 2 algorithms.

Key Result

Theorem 1

The knockoff+ threshold where $\mathcal{W} = \{|W_j|: j = 1, \ldots, p\}$, provides finite-sample FDR control at level $q$ under arbitrary dependence between features.

Figures (1)

  • Figure 1: Knockoff statistics for SAE latents. We compute Model-X knockoff+ statistics $W$ for the top $p=512$ energy-filtered latents from Pythia-70M (layer 3) on 4,096 SST-2 sentences and select features with $W \ge \tau$ at target FDR $q=0.1$. (a) Histogram of $W$ with threshold $\tau=0.158$. (b) Sorted $W$ (waterfall); red bars indicate the 129 selected features. (c) Cumulative distribution function of $W$. Summary: $129/512$ features selected; mean $W$ (selected) $=0.363$; mean $W$ (rejected) $=0.011$; signal-to-noise $=$ mean $W_{\text{selected}}$ / mean $|W_{\text{rejected}}|$$=5.40\times$; Cohen's $d$ (selected vs. all rejected) $=1.79$. Using only the selected features, an $\ell_1$-regularized logistic classifier achieves $77.4\%$ training accuracy.

Theorems & Definitions (2)

  • Definition 1: Knockoff Variables
  • Theorem 1: FDR Control barber2015controlling