A Data-driven Typology of Vision Models from Integrated Representational Metrics
Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla
TL;DR
The paper tackles how to distinguish universal versus family-specific representations across diverse vision models. It introduces a data-driven framework that combines multiple representational metrics with Similarity Network Fusion to produce robust, composite model signatures and a typology that transcends traditional architecture-based grouping. Geometry- and tuning-preserving metrics show strong family discrimination, while linearly decodable information is more shared; SNF integration yields dramatically improved separation (e.g., $d' \approx 11.84$) and coherent model clusters. The resulting typology reveals that self-supervised models form a unified cluster across architectures, while hybrids align with MAE strategies, highlighting emergent computational signatures shaped by architecture and training objective. This approach provides a principled, scalable way to compare new vision models, predict transfer behavior, and understand the representational principles underlying successful visual processing.
Abstract
Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.
