Table of Contents
Fetching ...

Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng

TL;DR

This work investigates whether vision models encode the hierarchical structure of the ImageNet taxonomy in their internal representations. It adapts Sparse Autoencoders to probe layer-wise activations in the vision foundation model DINOv2, introducing metrics such as Lowest Common Hypernym height and Ontological Coverage and employing relevancy maps for grounding. The study finds that hierarchical information emerges in deeper layers, with SAEs uncovering taxonomic relationships and higher-order concepts, while early layers show limited ontological structure. The results establish a framework for systematic hierarchical analysis of vision representations and demonstrate SAEs as a viable tool for probing semantic structure in deep networks, with implications for interpretability and representation editing.

Abstract

The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

TL;DR

This work investigates whether vision models encode the hierarchical structure of the ImageNet taxonomy in their internal representations. It adapts Sparse Autoencoders to probe layer-wise activations in the vision foundation model DINOv2, introducing metrics such as Lowest Common Hypernym height and Ontological Coverage and employing relevancy maps for grounding. The study finds that hierarchical information emerges in deeper layers, with SAEs uncovering taxonomic relationships and higher-order concepts, while early layers show limited ontological structure. The results establish a framework for systematic hierarchical analysis of vision representations and demonstrate SAEs as a viable tool for probing semantic structure in deep networks, with implications for interpretability and representation editing.

Abstract

The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

Paper Structure

This paper contains 21 sections, 8 equations, 3 figures.

Figures (3)

  • Figure 1: Results of training a ReLU SAE (or linear probe) on every layer of DINOv2's class token on ImageNet. We find the surprising result that the early layers in this model are non-informative: the representations are incredibly easy to auto-encode (right y-axis), require very few activations from an SAE (right y-axis), and are not usable for fitting a classification model (left y-axis).
  • Figure 2: Distribution of LCH Height vs Ontological Coverage for SAE Heads at Layer 24, 28, 32 and 36 of DINOv2. For each layer, we plot the distribution of LCH height and ontological coverage of the SAE heads. Darker indicates higher bin density. Not only does the vision model capture hierarchical concepts in its output, but also show signs of enhancing hierarchical features through out its processing layer-by-layer.
  • Figure 3: Relevancy maps of the hierarchical SAE head at DINOv2 Layer 36 activating on images of whales. These relevancy maps show the model highly activating on the hierarchical concept of both Orcas and Grey Whales, which show DINOv2's ability to focus on highly meaningful parts of an image.