Table of Contents
Fetching ...

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos Derpanis

TL;DR

Universal Sparse Autoencoders (USAEs) address the challenge of interpreting multiple pretrained vision models by learning a shared, sparse concept space that can jointly encode and reconstruct activations across models. By training model-specific encoders and decoders to operate within a single universal code Z, USAEs enable cross-model reconstruction, concept alignment, and a new Coordinated Activation Maximization application that visualizes the same concept across architectures. The authors demonstrate that USAEs discover a spectrum of universal concepts from low-level features to high-level structures, and quantify universality via firing entropy and co-fire proportions, with cross-model reconstruction supported by $R^2$ scores. They further compare universal concepts to independently learned SAEs, showing meaningful overlaps and identifying new universal representations unique to the cross-model objective. Overall, USAEs provide a scalable, gradient-based approach for multi-model interpretability with practical benefits for understanding and coordinating multi-model AI systems.

Abstract

We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

TL;DR

Universal Sparse Autoencoders (USAEs) address the challenge of interpreting multiple pretrained vision models by learning a shared, sparse concept space that can jointly encode and reconstruct activations across models. By training model-specific encoders and decoders to operate within a single universal code Z, USAEs enable cross-model reconstruction, concept alignment, and a new Coordinated Activation Maximization application that visualizes the same concept across architectures. The authors demonstrate that USAEs discover a spectrum of universal concepts from low-level features to high-level structures, and quantify universality via firing entropy and co-fire proportions, with cross-model reconstruction supported by scores. They further compare universal concepts to independently learned SAEs, showing meaningful overlaps and identifying new universal representations unique to the cross-model objective. Overall, USAEs provide a scalable, gradient-based approach for multi-model interpretability with practical benefits for understanding and coordinating multi-model AI systems.

Abstract

We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems

Paper Structure

This paper contains 25 sections, 13 equations, 16 figures.

Figures (16)

  • Figure 1: Overview of Universal Sparse Autoencoders. (A) We introduce Universal Sparse Autoencoders (USAEs), a method for discovering common concepts across multiple different deep neural networks. USAEs are simultaneously trained on the activations of multiple models and are constrained to share an aligned and interpretable dictionary of discovered concepts. (B) We also demonstrate one immediate application of USAEs, Coordinated Activation Maximization, where optimizing the inputs of multiple models to activate the same concepts reveals how different models encode the same concept. Visualization reveals interesting concepts at various levels of abstraction, such as 'curves' (top), 'animal haunch' (middle) and 'the faces of crowds' (bottom). Better viewed with zoom.
  • Figure 2: USAE training process. In each forward pass during training, an encoder of model $i$ is randomly selected to encode a batch of that model's activations, $\bm{Z} = \bm{\Psi}_{\theta}^{(i)}(\bm{A}^{(i)})$. The concept space, $\bm{Z}$, is then decoded to reconstruct every model's activations, $\widehat{\bm{A}}^{(j)}$, using their respective decoders, $\bm{D}^{(j)}$.
  • Figure 3: Training Universal Sparse Autoencoder. During each training iteration, $\mathcal{L}_{\text{Universal}}$ is the aggregated error computed from decoding each activation $\widehat{A}^{(j)}$. We then take an optimizer step for randomly selected encoder $\bm{\Psi}_{\theta}^{(i)}$ and associated dictionary $\bm{D}^{(i)}$.
  • Figure 4: Qualitative results of universal concepts. We discover and visualize heatmaps of universal concepts across a broad range of visual abstractions, where bright green denotes a stronger activation of a given concept. We observe colors, basic shapes, foreground-background, parts, objects and their groupings across all considered models.
  • Figure 5: Cross model activation reconstruction. Each entry $(i, j)$ represents the average $R^2$ score when activations from model $\bm{A}^{(i)}$ are encoded into the shared code space, $\bm{Z}$, then decoded via $\bm{D}^{(j)}$ to reconstruct $\widehat{\bm{A}}^{(j)}$. Positive off-diagonal $R^2$ scores indicate the presence of shared features across models captured by USAEs.
  • ...and 11 more figures