Comparing Computational Pathology Foundation Models using Representational Similarity Analysis
Vaibhav Mishra, William Lotter
TL;DR
This work addresses how computational pathology foundation models organize information beyond downstream task accuracy by applying Representational Similarity Analysis (RSA) to six models spanning vision-language and vision-only self-distillation paradigms. Using TCGA H&E patches, the study constructs and compares Representational Dissimilarity Matrices to reveal cross-model similarities, slide- and disease-dependence, and intrinsic dimensionality, finding that UNI2 and Virchow2 are most distinct while Prov-GigaPath shows broad cross-model alignment; slide-specific signals dominate over disease signals, and stain normalization reduces slide-dependence. Vision-language models tend to occupy more compact representational spaces than vision-only models, suggesting different encoding bottlenecks tied to training objectives. The findings inform model ensembling and robustness considerations, highlighting that representational structure is shaped by training strategy and data characteristics, and demonstrate a generalizable RSA framework for medical imaging foundations.
Abstract
Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can support their effective development and deployment.
