Table of Contents
Fetching ...

A Data-driven Typology of Vision Models from Integrated Representational Metrics

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

TL;DR

The paper tackles how to distinguish universal versus family-specific representations across diverse vision models. It introduces a data-driven framework that combines multiple representational metrics with Similarity Network Fusion to produce robust, composite model signatures and a typology that transcends traditional architecture-based grouping. Geometry- and tuning-preserving metrics show strong family discrimination, while linearly decodable information is more shared; SNF integration yields dramatically improved separation (e.g., $d' \approx 11.84$) and coherent model clusters. The resulting typology reveals that self-supervised models form a unified cluster across architectures, while hybrids align with MAE strategies, highlighting emergent computational signatures shaped by architecture and training objective. This approach provides a principled, scalable way to compare new vision models, predict transfer behavior, and understand the representational principles underlying successful visual processing.

Abstract

Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

A Data-driven Typology of Vision Models from Integrated Representational Metrics

TL;DR

The paper tackles how to distinguish universal versus family-specific representations across diverse vision models. It introduces a data-driven framework that combines multiple representational metrics with Similarity Network Fusion to produce robust, composite model signatures and a typology that transcends traditional architecture-based grouping. Geometry- and tuning-preserving metrics show strong family discrimination, while linearly decodable information is more shared; SNF integration yields dramatically improved separation (e.g., ) and coherent model clusters. The resulting typology reveals that self-supervised models form a unified cluster across architectures, while hybrids align with MAE strategies, highlighting emergent computational signatures shaped by architecture and training objective. This approach provides a principled, scalable way to compare new vision models, predict transfer behavior, and understand the representational principles underlying successful visual processing.

Abstract

Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

Paper Structure

This paper contains 40 sections, 2 equations, 23 figures, 1 table.

Figures (23)

  • Figure 1: Top left: Each representational metric defines a pairwise similarity matrix over models. Bottom left: Each matrix is visualized as an affinity graph, with nodes representing models and edge widths reflecting pairwise similarity strength; weak similarities below a threshold are omitted for clarity. Right: A consensus matrix obtained via Similarity Network Fusion (SNF) highlights relations consistently supported across metrics while leveraging complementary signals. In the fused graph, solid edges denote agreement across all metrics, dotted edges indicate partial support; strong but uncorroborated edges may persist with reduced weight (e.g., edge 4–5); weak AND metric-specific connections are typically suppressed (e.g., edge 1–5).
  • Figure 2: Model-family separability on ImageNet under $d'$, silhouette score and contrastive ratio. Columns correspond to nine similarity metrics, including two fusion-based methods (SNF, average) and seven commonly used representational metrics (Distinct aspects of representation emphasized by each metric are shown in the bracketed text). Fusion-based metrics consistently yield higher scores, highlighting their effectiveness in capturing family-level distinctions.
  • Figure 3: Mean model-family separability on ImageNet, evaluated using $d'$, silhouette score, and contrastive ratio. Fusion-based metrics (SNF, Average) outperform individual similarity metrics across all datasets, with SNF yielding the most consistent and robust separation. Scores are shown in their native scales and are not directly comparable across measures.
  • Figure 4: Hierarchical clustering of models using three functionally distinct representational similarity metrics and a similarity-metric–averaging baseline (see Fig. \ref{['fig:metrics_clustering_imagenet_2']} for additional metrics). Clustering is performed with average linkage and optimal leaf ordering, based on induced distances ($1 -$ similarity score). Rows and columns are deliberately re-ordered to match the leaf ordering produced by the clustering algorithm. Lighter colors indicate higher similarity; diagonal entries (self-comparisons) are omitted.
  • Figure 5: SNF-based clustering reveals that models naturally group by architecture and supervision regime. Supervised CNNs and ViTs each form distinct clusters; hybrid models (ConvNeXt, Swin) and MAE ViTs cluster together; and self-supervised models (e.g., DINO ViTs, self-supervised CNNs) form coherent groups. The heatmap shows the SNF-fused similarity matrix reordered by leaf ordering. Leaf labels are colored by the cluster (formed by SNF) they belong to; dendrogram cuts yield up to six flat clusters aligned with canonical categories.
  • ...and 18 more figures