Table of Contents
Fetching ...

Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations

Xiao Zhang, David Yunis, Michael Maire

TL;DR

This work introduces a gradient-based spectral clustering framework that analyzes all layers of a pre-trained vision model to extract dense image regions without labels. By constructing and optimizing over a set of layer-wise affinities derived from attention (Q-K) and value (V-V) relations, the method yields eigenvector embeddings that reveal both spatial layout and object identity, effectively separating a 'where' pathway from a 'what' pathway. Retrieved regions demonstrate strong per-image segmentation across diverse backbones and, at the dataset level, uncover a robust spatial-semantic split that enables unsupervised semantic segmentation competitive with state-of-the-art methods. The approach offers a scalable, training-free tool for interpreting large foundation models and potentially guiding downstream tasks without task-specific fine-tuning.

Abstract

We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a holistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).

Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations

TL;DR

This work introduces a gradient-based spectral clustering framework that analyzes all layers of a pre-trained vision model to extract dense image regions without labels. By constructing and optimizing over a set of layer-wise affinities derived from attention (Q-K) and value (V-V) relations, the method yields eigenvector embeddings that reveal both spatial layout and object identity, effectively separating a 'where' pathway from a 'what' pathway. Retrieved regions demonstrate strong per-image segmentation across diverse backbones and, at the dataset level, uncover a robust spatial-semantic split that enables unsupervised semantic segmentation competitive with state-of-the-art methods. The approach offers a scalable, training-free tool for interpreting large foundation models and potentially guiding downstream tasks without task-specific fine-tuning.

Abstract

We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a holistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).
Paper Structure (28 sections, 8 equations, 13 figures, 8 tables)

This paper contains 28 sections, 8 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Spectral clustering of layer-distributed representations. For each input image, we collect key, query, and value feature vectors from attention layers across network depth (and, for diffusion models, time). Intra- and inter-image value-value (top) and key-query (bottom) similarity define a collection of affinity matrices indexed by layer (and time). We solve for pseudo-eigenvectors ${\bm{X}}$ which, when scaled to the spatial resolution of each layer via $g(\cdot)$, best satisfy an average of per-layer spectral partitioning criteria. The leading eigenvector from value-value affinity reveals semantic category (top), while that from key-query affinity reveals spatial layout (bottom).
  • Figure 2: Features extracted from different models on PASCAL VOC everingham2015pascal. Across models we extract meaningful regions, even for models like Stable Diffusion rombach2022high, CLIP radford2021learning or MAE he2022masked whose training is not well-aligned with segmentation.
  • Figure 3: Oracle-based semantic segmentation performance with varying region count. Across models and number of clusters (regions) returned by K-Means, our method (Ours + K-Means) yields better agreement (in mIoU) with ground-truth than running Normalized Cuts (Ncut + K-Means), or directly applying K-Means on the final output features of the model (K-Means). We observe an even more significant improvement when applying our method to MAE and CLIP, which do not produce discriminative features.
  • Figure 4: Extracted eigenvectors on COCO for both graph choices. We visualize selected components of ${\bm{X}}_\text{ortho}$, sorted by decreasing eigenvalue. Three eigenvectors at a time are rendered as RGB images. In the Q-K case, the first set of eigenvectors describes general scene spatial layout in terms of ground, subject, background, and sky. The second finds top-to-bottom part separation within objects. In the V-V case, the first set of eigenvectors partitions the image into coarse semantics like trees, ground, and sky, while the second set recognizes finer-grained categories and groups individual objects like people, animals, and vehicles.
  • Figure 5: Extracted eigenvectors on Cityscapes for both graph choices. We visualize selected components of ${\bm{X}}_\text{ortho}$, sorted by decreasing eigenvalue. Three eigenvectors at a time are rendered as RGB images. In the Q-K case, eigenvectors detect the scene spatial layout and indicate how far left or right buildings, cars, trees, and people are. In the V-V case, eigenvectors perform semantic recognition and separate trees and buildings from road, and distinguish cars, people, and road markings.
  • ...and 8 more figures