Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek
TL;DR
This work introduces Matryoshka Sparse Autoencoders (MSAE) to interpret CLIP by learning hierarchical, multi-granularity representations that jointly optimize reconstruction fidelity and sparsity. By applying a series of TopK operations across increasing granularity levels and combining losses, MSAE establishes a new Pareto frontier on CLIP embeddings and enables extraction of hundreds of interpretable concepts. The authors demonstrate the utility of MA SAE for concept-based similarity search and bias analysis in downstream tasks like CelebA, validating the approach across CC3M and ImageNet with multiple CLIP architectures. The release of code and the demonstrated ability to control and analyze CLIP representations suggest significant practical impact for interpretable multimodal AI and bias auditing.
Abstract
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.
