Table of Contents
Fetching ...

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo, Steffen Schneider

TL;DR

<3-5 sentence high-level summary> PatchSAE introduces a sparse autoencoder framework tailored to CLIP's Vision Transformer to extract patch-level visual concepts and their spatial attributions. By analyzing how these concepts align with classification outcomes, the study reveals that adaptation methods largely reuse existing concepts and primarily remap them to downstream task classes rather than introducing many new concepts. The authors demonstrate the tool's ability to localize concepts, reveal class-discriminative latent directions, and explain the mechanisms behind prompt-based adaptation (e.g., MaPLe) across 11 datasets. This work provides a concrete framework to debug, interpret, and categorize adaptation strategies for vision-language foundation models, with implications for principled model customization and explainability.

Abstract

Adapting foundation models for specific purposes has become a standard approach to build machine learning systems for downstream applications. Yet, it is an open question which mechanisms take place during adaptation. Here we develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts at granular levels (e.g., shape, color, or semantics of an object) and their patch-wise spatial attributions. We explore how these concepts influence the model output in downstream image classification tasks and investigate how recent state-of-the-art prompt-based adaptation techniques change the association of model inputs to these concepts. While activations of concepts slightly change between adapted and non-adapted models, we find that the majority of gains on common adaptation tasks can be explained with the existing concepts already present in the non-adapted foundation model. This work provides a concrete framework to train and use SAEs for Vision Transformers and provides insights into explaining adaptation mechanisms.

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

TL;DR

<3-5 sentence high-level summary> PatchSAE introduces a sparse autoencoder framework tailored to CLIP's Vision Transformer to extract patch-level visual concepts and their spatial attributions. By analyzing how these concepts align with classification outcomes, the study reveals that adaptation methods largely reuse existing concepts and primarily remap them to downstream task classes rather than introducing many new concepts. The authors demonstrate the tool's ability to localize concepts, reveal class-discriminative latent directions, and explain the mechanisms behind prompt-based adaptation (e.g., MaPLe) across 11 datasets. This work provides a concrete framework to debug, interpret, and categorize adaptation strategies for vision-language foundation models, with implications for principled model customization and explainability.

Abstract

Adapting foundation models for specific purposes has become a standard approach to build machine learning systems for downstream applications. Yet, it is an open question which mechanisms take place during adaptation. Here we develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts at granular levels (e.g., shape, color, or semantics of an object) and their patch-wise spatial attributions. We explore how these concepts influence the model output in downstream image classification tasks and investigate how recent state-of-the-art prompt-based adaptation techniques change the association of model inputs to these concepts. While activations of concepts slightly change between adapted and non-adapted models, we find that the majority of gains on common adaptation tasks can be explained with the existing concepts already present in the non-adapted foundation model. This work provides a concrete framework to train and use SAEs for Vision Transformers and provides insights into explaining adaptation mechanisms.

Paper Structure

This paper contains 24 sections, 3 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Overview. (a) We train our PatchSAE on a frozen CLIP ViT with an MSE loss and an L1 sparsity regularizer using ImageNet (IN) (§\ref{['sec:method_sae']}). (b) We analyze the trained PatchSAE by interpreting patch- and image-level concepts of activated SAE latents (§\ref{['sec:analyzing_sae']} & \ref{['sec:sae_outputs']}). (c) We then investigate the influence of SAE latents on the model behavior in classification tasks (§\ref{['sec:sae_classification_relation']}) and explain how adaptation methods improve the downstream task performance (§\ref{['sec:adaptation-method-sae']}).
  • Figure 2: Analyzing SAE latents. (a) We take an average over patch-level activations for an image and keep top-$k$ images having the highest mean activation as the reference images for each SAE latent. (b) From patch-level latent activations, we investigate localized concepts. Furthermore, we represent image-, class-, and dataset-wise concepts by aggregating the patch-level activations. (c) For a certain concept, we provide the spatial attribution of the concept by visualizing the patch-level activations as a segmentation mask.
  • Figure 3: SAE latents statistics and reference images. Left: Scatter plot of SAE latent statistics ($x$-axis: $\text{log} 10$ of activated frequency, $y$-axis: $\text{log} 10$ of mean activation) colored by label entropy. Right: Reference images from Imagenet of four SAE latents in different regions.
  • Figure 4: Localizing SAE latent activations under a covariate shift. Given two input images of class hen, we show image-level aggregated SAE latent activations ($x$-axis: SAE latents index $y$-axis: image-level activation), reference images from ImageNet, and segmentation masks for each input are shown. Among top 10 latents for each input, we pick three interpretable indices where (a) and (b) represent different domains (image style or background) and (c) shows the shared concept.
  • Figure 5: Top-$k$ SAE latent masking.(a) Top-$k$ SAE latent masking implementation for CLIP and MaPLe. MaPLe adds learnable prompt tokens upon CLIP. (b) CLIP (zero-shot) classification accuracy on ImageNet-1K for different SAE latent masking. Top-$k$ selection based on class-level latent activations crucially affects the accuracy while random or dataset-level based selections show marginal or no impact. (c) Example images and top-$k$ class-level masking experiment for 11 tasks.
  • ...and 17 more figures