Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control
Tsogt-Ochir Enkhbayar
TL;DR
The paper tackles validating sparse autoencoder features for neural interpretability while controlling false discoveries in high-dimensional settings. It adapts Model-X knockoffs to SAE latents, employing Gaussian knockoffs with energy-based feature reduction and an L1-regularized logistic model to obtain knockoff statistics and FDR-controlled feature selection. In a sentiment classification experiment on SST-2 with Pythia-70M, it identifies 129 genuine SAE features from 512 candidates at a configurable FDR level, achieving a 5.40× signal-to-noise ratio and a training accuracy of 77.4% on the augmented design. This framework provides finite-sample statistical guarantees, separates real signals from noise and spurious correlations, and offers a reusable, reproducible approach for principled mechanistic interpretability with broad applicability.
Abstract
Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.
