Table of Contents
Fetching ...

Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers

Johanna Vielhaben, Dilyara Bareeva, Jim Berend, Wojciech Samek, Nils Strodthoff

TL;DR

This work addresses the limitation of scalar-only alignment in Vision Transformers by introducing a concept-based alignment framework. Concepts are defined as nonlinear manifolds in feature space and discovered via a two-stage clustering approach using UMAP and HDBSCAN, producing soft concept proximity scores. Alignment across representations is quantified with a generalized Rand-index-based pseudo-metric that decomposes into per-concept contributions, enabling fine-grained intra- and inter-model comparisons. Across four ViTs trained with varying supervision, the approach reveals that increased supervision reduces semantic structure in learned representations, offering nuanced insights into robustness, generalization, and information flow beyond traditional scalar metrics.

Abstract

Vision transformers (ViTs) can be trained using various learning paradigms, from fully supervised to self-supervised. Diverse training protocols often result in significantly different feature spaces, which are usually compared through alignment analysis. However, current alignment measures quantify this relationship in terms of a single scalar value, obscuring the distinctions between common and unique features in pairs of representations that share the same scalar alignment. We address this limitation by combining alignment analysis with concept discovery, which enables a breakdown of alignment into single concepts encoded in feature space. This fine-grained comparison reveals both universal and unique concepts across different representations, as well as the internal structure of concepts within each of them. Our methodological contributions address two key prerequisites for concept-based alignment: 1) For a description of the representation in terms of concepts that faithfully capture the geometry of the feature space, we define concepts as the most general structure they can possibly form - arbitrary manifolds, allowing hidden features to be described by their proximity to these manifolds. 2) To measure distances between concept proximity scores of two representations, we use a generalized Rand index and partition it for alignment between pairs of concepts. We confirm the superiority of our novel concept definition for alignment analysis over existing linear baselines in a sanity check. The concept-based alignment analysis of representations from four different ViTs reveals that increased supervision correlates with a reduction in the semantic structure of learned representations.

Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers

TL;DR

This work addresses the limitation of scalar-only alignment in Vision Transformers by introducing a concept-based alignment framework. Concepts are defined as nonlinear manifolds in feature space and discovered via a two-stage clustering approach using UMAP and HDBSCAN, producing soft concept proximity scores. Alignment across representations is quantified with a generalized Rand-index-based pseudo-metric that decomposes into per-concept contributions, enabling fine-grained intra- and inter-model comparisons. Across four ViTs trained with varying supervision, the approach reveals that increased supervision reduces semantic structure in learned representations, offering nuanced insights into robustness, generalization, and information flow beyond traditional scalar metrics.

Abstract

Vision transformers (ViTs) can be trained using various learning paradigms, from fully supervised to self-supervised. Diverse training protocols often result in significantly different feature spaces, which are usually compared through alignment analysis. However, current alignment measures quantify this relationship in terms of a single scalar value, obscuring the distinctions between common and unique features in pairs of representations that share the same scalar alignment. We address this limitation by combining alignment analysis with concept discovery, which enables a breakdown of alignment into single concepts encoded in feature space. This fine-grained comparison reveals both universal and unique concepts across different representations, as well as the internal structure of concepts within each of them. Our methodological contributions address two key prerequisites for concept-based alignment: 1) For a description of the representation in terms of concepts that faithfully capture the geometry of the feature space, we define concepts as the most general structure they can possibly form - arbitrary manifolds, allowing hidden features to be described by their proximity to these manifolds. 2) To measure distances between concept proximity scores of two representations, we use a generalized Rand index and partition it for alignment between pairs of concepts. We confirm the superiority of our novel concept definition for alignment analysis over existing linear baselines in a sanity check. The concept-based alignment analysis of representations from four different ViTs reveals that increased supervision correlates with a reduction in the semantic structure of learned representations.

Paper Structure

This paper contains 44 sections, 11 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: We combine concept discovery with alignment analysis for fine-grained insights into structures within and differences between latent activations. To this end, we investigate latent activations formed by intermediate layers, which according to the manifold hypothesis can be organized in terms of low-dimensional manifolds. We recover manifolds using density-based clustering applied to UMAP embeddings of the latent representations. The discovered structures in latent space do not only allow to characterize a single layer, but also the formation of structures between layers.
  • Figure 2: We evaluate the quality of concept discovery. RMSE measures the MSE between the distance matrix of the original and embedded activations and shows how faithfully the UMAP embedding captures the geometry of the representation. DBCV is a density-based clustering validity index that contrasts intra- vs inter-cluster density. The noise rate is the ratio of points classified as noise in HDBSCAN. Robustness is measured between two runs by concept-alignment from \ref{['eq:CBA']}. Results are across layers for CLS (dotted) and SEQ (solid) token representations of the models in \ref{['tab:models']}.
  • Figure 3: Concept formation graph for the concept "apple(s)" in layer 9 of the FS model. Each concept is represented by six randomly sampled images containing a token assigned to that concept (highlighted in a yellow frame).
  • Figure 4: Intra-model relationships based on SEQ representations across layers. In the upper row, we show CBA from \ref{['eq:CBA']} to visualize how representations are transformed across layers of the models from \ref{['tab:models']} (darker pixels correspond to higher alignment). We observe a nucleation process between layer 9 and 10 in FS and smoother processing split into two major blocks between layer 1-6 and 6-11 in CLIP, DINO and FS. In the center and bottom row we zoom into the representations at layer 6 and 11 of each model and partition the scalar CBA alignment into single concepts. We show a UMAP embedding constructed from the pairwise distance of concept measured by $d_{cross}(P^{\alpha},P^{\beta} )$ from \ref{['eq:CBA_dist_c']}. Each point in this concept atlas corresponds to a distinct concept $P^{\alpha}$. To convey their meaning, we show four random input tokens from the members of the concept cluster $P^{\alpha}$ marked by a yellow box in the entire image. The higher the level of supervision of ViT training ranging from FS, over CLIP to DINO and MAE, the less semantically organized are the representations at layer 11.
  • Figure 5: Class label alignment, token location alignment (both based on CBA from \ref{['eq:CBA']}), concept count, and the average intrinsic dimensionality (based on Facco2017) across concepts supplement the intra-model alignment analysis, by providing insights into how well the model aligns with ImageNet-1k labels, the spatial organization of tokens, and the complexity of the learned concepts as they evolve through the layers.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 1