Table of Contents
Fetching ...

Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek

TL;DR

This work introduces Matryoshka Sparse Autoencoders (MSAE) to interpret CLIP by learning hierarchical, multi-granularity representations that jointly optimize reconstruction fidelity and sparsity. By applying a series of TopK operations across increasing granularity levels and combining losses, MSAE establishes a new Pareto frontier on CLIP embeddings and enables extraction of hundreds of interpretable concepts. The authors demonstrate the utility of MA SAE for concept-based similarity search and bias analysis in downstream tasks like CelebA, validating the approach across CC3M and ImageNet with multiple CLIP architectures. The release of code and the demonstrated ability to control and analyze CLIP representations suggest significant practical impact for interpretable multimodal AI and bias auditing.

Abstract

Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.

Interpreting CLIP with Hierarchical Sparse Autoencoders

TL;DR

This work introduces Matryoshka Sparse Autoencoders (MSAE) to interpret CLIP by learning hierarchical, multi-granularity representations that jointly optimize reconstruction fidelity and sparsity. By applying a series of TopK operations across increasing granularity levels and combining losses, MSAE establishes a new Pareto frontier on CLIP embeddings and enables extraction of hundreds of interpretable concepts. The authors demonstrate the utility of MA SAE for concept-based similarity search and bias analysis in downstream tasks like CelebA, validating the approach across CC3M and ImageNet with multiple CLIP architectures. The release of code and the demonstrated ability to control and analyze CLIP representations suggest significant practical impact for interpretable multimodal AI and bias auditing.

Abstract

Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.

Paper Structure

This paper contains 43 sections, 5 equations, 24 figures, 18 tables.

Figures (24)

  • Figure 1: Matryoshka Sparse Autoencoder (MSAE) enables learning hierarchical concept representations from coarse to fine-grained features while avoiding rigid sparsity constraints in TopK and the activation shrinkage problem in ReLU SAE. (B) At training, MSAE uses multiple top-$k$ values up to dimension $d$ instead of a single $k$ like in TopK SAE, combining losses across different granularities. (C) At inference, our method uses the whole $d$-dimensional representation. (D) MSAE allows for more precise editing and manipulation in the concept space.
  • Figure 2: Comparison of sparsity--fidelity trade-offs across SAE architectures on ImageNet-1k. Each model presents results from all 3 expansion rates, comparing ReLU SAE ($\lambda=\{0.03, 0.01, 0.003\}$), TopK SAE ($k=64, 128, 256\}$), BatchTopK SAE ($k= 64, 128, 256\}$) and MSAE (RW, UW). The optimal SAE would occupy the upper right corner, achieving both high sparsity and reconstruction fidelity. For extended results across both modalities, refer to Figure \ref{['fig:appendix_pareto']}.
  • Figure 3: Low Granularity Level Matroshka vs. TopK SAE on ImageNet-1k. We report FVU (left) and CKNNA (right) metrics for two TopK variants ($k = 128,256$), and Matryoshka trained on these granularities in RW and UW variants at expansion rates 8 and 16. Even at this small granularity, MSAE improves the Pareto frontier relative to both TopK variants, pushing it as the expansion rate grows from 8 to 16. For extended results across other metrics, refer to Figure \ref{['fig:appendix_matryoshkavstopk']}.
  • Figure 4: Distribution of non-zero SAE activations on ImageNet-1k validation set. Frequency histograms for ReLU ($\lambda = 0.003$), TopK ($k = 32$), and Matryoshka (RW) models at expansion rate 8. Matryoshka models exhibit a double-curvature distribution similar to ReLU models but without activation shrinkage, while TopK shows this pattern only at higher $k$ values, as can be seen in an extended Figure \ref{['fig:statistic_8']}. Extended results for higher expansion rates are reported in Figure \ref{['fig:statistic_more']}.
  • Figure 5: Progressive recovery performance on ImageNet-1k. We report FVU (left) and CKNNA (right) metrics for different SAE architectures with expansion rate 8 as functions of increasing top $k$ values by magnitudes of SAE activations during inference. SAE trained with TopK variants ($k=32, 64$) show performance plateaus beyond their training thresholds, while ReLU-based models ($\lambda=0.001, 0.003$) and Matryoshka variants (UW and RW) demonstrate continuous improvement. Extended results for higher expansion rates and across other metrics are reported in Figures \ref{['fig:progressive_8']} & \ref{['fig:progressive_more']}.
  • ...and 19 more figures