Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders
Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Daniil Gavrilov, Nikita Balagansky
TL;DR
KronSAE tackles the encoder bottleneck in sparse autoencoders used for interpreting language-model activations by factorizing the latent space with head-wise Kronecker products and introducing a differentiable mAND gate. The approach reduces encoder FLOPs from $O(Fd)$ to $O(h(m+n)d)$ while preserving reconstruction quality and improving feature disentanglement and interpretability, including lower feature absorption. Across multiple models (Qwen, Pythia, Gemma) and token budgets, KronSAE achieves on-par explained variance with fewer parameters and demonstrates clearer, more monosemantic latent structure. Ablation studies and analyses link the gains to the compositional latent architecture and AND-like gating, offering a scalable path to interpretable, efficient latent representations in large-scale language-model analyses.
Abstract
Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.
