Table of Contents
Fetching ...

Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders

Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Daniil Gavrilov, Nikita Balagansky

TL;DR

KronSAE tackles the encoder bottleneck in sparse autoencoders used for interpreting language-model activations by factorizing the latent space with head-wise Kronecker products and introducing a differentiable mAND gate. The approach reduces encoder FLOPs from $O(Fd)$ to $O(h(m+n)d)$ while preserving reconstruction quality and improving feature disentanglement and interpretability, including lower feature absorption. Across multiple models (Qwen, Pythia, Gemma) and token budgets, KronSAE achieves on-par explained variance with fewer parameters and demonstrates clearer, more monosemantic latent structure. Ablation studies and analyses link the gains to the compositional latent architecture and AND-like gating, offering a scalable path to interpretable, efficient latent representations in large-scale language-model analyses.

Abstract

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders

TL;DR

KronSAE tackles the encoder bottleneck in sparse autoencoders used for interpreting language-model activations by factorizing the latent space with head-wise Kronecker products and introducing a differentiable mAND gate. The approach reduces encoder FLOPs from to while preserving reconstruction quality and improving feature disentanglement and interpretability, including lower feature absorption. Across multiple models (Qwen, Pythia, Gemma) and token budgets, KronSAE achieves on-par explained variance with fewer parameters and demonstrates clearer, more monosemantic latent structure. Ablation studies and analyses link the gains to the compositional latent architecture and AND-like gating, offering a scalable path to interpretable, efficient latent representations in large-scale language-model analyses.

Abstract

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

Paper Structure

This paper contains 63 sections, 15 equations, 23 figures, 10 tables.

Figures (23)

  • Figure 1: Maximum performance for KronSAE vs. TopK SAE vs. Matryoshka TopK SAE on Qwen-1.5B for different dictionary sizes $F$ and budgets in iso-FLOP setting. KronSAE with lower number of parameters is on-par with the baseline, and the gap narrows with larger dictionary size.
  • Figure 2: Dependency of EV on head count $h$ (on the x-axis) and base dimension $m$ under 500M and 1B token budgets in iso-FLOPs setup. Higher $h$ and smaller $m$ yield improved reconstruction quality because of higher expressivity of pre-latents to encode semantics and increasing trainable parameters.
  • Figure 3: Maximum performance for baselines and their KronSAE modifications for different sparsity levels in iso-FLOP setting. KronSAE variants, despite using fewer trainable parameters, achieve reconstruction quality comparable to or better than the unmodified baselines.
  • Figure 4: Feature absorption metrics on Qwen-2.5 1.5B and Gemma-2 2B. KronSAE configurations (various $m,n$) exhibit lower mean absorption fractions and full‐absorption scores across different $\ell_0$ and selected baselines.
  • Figure 5: We generate data with covariance matrix that consist of blocks with different sizes on diagonal and off diagonal (left panel). We then examine the decoder‐weight covariance $W_{\mathrm{dec}} \cdot W_{\mathrm{dec}}^\top$ to assess feature‐embedding correlations and compute the RV score to quantify the similarity between learned and ground truth covariance matrices. Second panel show feature embeddings for trained autoencoder $W_{\mathrm{enc}} \cdot W_{\mathrm{enc}}^T$. Third panel demonstrates that a TopK SAE recovers these correlation structures weakly, as indicated by relatively low RV coefficient $(0.157)$ even after optimal atom matching. In contrast, KronSAE (right panel) more accurately reveals the original block patterns.
  • ...and 18 more figures