Table of Contents
Fetching ...

Data Whitening Improves Sparse Autoencoder Learning

Ashwin Saraswatula, David Klindt

TL;DR

This work demonstrates that PCA whitening of input activations reshapes the SAE optimization landscape, making it more isotropic and conducive to learning interpretable, sparse features. The authors provide theoretical arguments, simulations, and SAEBench-based experiments showing significant interpretability gains for ReLU and Top-K SAEs, with modest reconstruction trade-offs. Whitening consistently improves metrics such as Sparse Probing, SCR, and TPP, challenging the idea that optimal sparsity–fidelity balance yields the most interpretable representations. The results suggest whitening should be a default preprocessing step when interpretability is the priority, and highlight the importance of activation geometry in feature formation beyond sparsity alone.

Abstract

Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations -- a standard preprocessing technique in classical sparse coding -- improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature disentanglement, despite minor drops in reconstruction quality. Our results challenge the assumption that interpretability aligns with an optimal sparsity--fidelity trade-off and suggest that whitening should be considered as a default preprocessing step for SAE training, particularly when interpretability is prioritized over perfect reconstruction.

Data Whitening Improves Sparse Autoencoder Learning

TL;DR

This work demonstrates that PCA whitening of input activations reshapes the SAE optimization landscape, making it more isotropic and conducive to learning interpretable, sparse features. The authors provide theoretical arguments, simulations, and SAEBench-based experiments showing significant interpretability gains for ReLU and Top-K SAEs, with modest reconstruction trade-offs. Whitening consistently improves metrics such as Sparse Probing, SCR, and TPP, challenging the idea that optimal sparsity–fidelity balance yields the most interpretable representations. The results suggest whitening should be a default preprocessing step when interpretability is the priority, and highlight the importance of activation geometry in feature formation beyond sparsity alone.

Abstract

Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations -- a standard preprocessing technique in classical sparse coding -- improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature disentanglement, despite minor drops in reconstruction quality. Our results challenge the assumption that interpretability aligns with an optimal sparsity--fidelity trade-off and suggest that whitening should be considered as a default preprocessing step for SAE training, particularly when interpretability is prioritized over perfect reconstruction.

Paper Structure

This paper contains 51 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Whitening transforms the optimization landscape. 3D visualization of the sparse coding landscape over all dictionary angles $(\theta_0, \theta_1) \in [0, 2\pi]^2$. Surface height shows sparsity (higher = sparser); color indicates feature recovery quality (brighter = better). A: Without whitening, high sparsity regions (peaks) are misaligned with accurate feature recovery (bright). B: After whitening, the landscape becomes isotropic and sparsity aligns with feature quality.
  • Figure 2: Complementary view of optimization landscape. Surface height shows feature recovery quality; color indicates sparsity level (brighter = sparser). A: Optimizing for sparsity (climbing to bright) may lead to poor feature recovery. B: After whitening, pursuing sparsity naturally yields interpretable features (bright colors at peaks).
  • Figure 3: ReLU architecture: Each line connects paired runs before (left) and after whitening (right) averaged across all configurations (both models, all widths, all sparsity penalties). The figure illustrates significant increases in Sparse Probing, SCR, and TPP, accompanied by modest decreases in CE Loss and Explained Variance.
  • Figure 4: Top-K architecture. Each line connects paired runs before (left) and after whitening (right), averaged across all configurations (both models, widths, and target L0s). The figure shows a strong increase in Sparse Probing with no significant changes in SCR or TPP, alongside small decreases in CE Loss and Explained Variance.